Computer Vision Unit 3-1
Computer Vision Unit 3-1
Syllabus content
Unit 3: Facial Recognition with Computer Vision
o Autonomous driving
Techniques:
o Pix2Pix: A GAN-based model for image-to-image translation.
2. Popular Datasets
Google Maps Dataset: Often used for translating satellite images to Google-style maps. This may involve
scraping or using public APIs to gather pairs of satellite images and map tiles.
DeepGlobe Road Extraction Dataset: Contains satellite images paired with road networks, useful for road
extraction tasks.
Inria Aerial Image Labeling Dataset: Provides aerial imagery paired with building annotations, useful for
building footprint extraction.
SpaceNet Dataset: A publicly available corpus of labeled satellite imagery focusing on road network
extraction and building footprint detection.
3. Challenges
High Variability in Satellite Images: Weather conditions, seasons, and atmospheric conditions can
drastically alter satellite images.
Precision: Translating complex details like roads and buildings while ensuring that small, intricate structures
are correctly represented.
Scale Differences: The resolution of satellite images may vary greatly, requiring models to adapt to different
scales effectively.
4. Use of GANs
Pix2Pix: A conditional GAN designed for paired image-to-image translation. It takes input images (satellite
images) and produces output images (maps).
CycleGAN: Useful for unpaired image translation when paired datasets are unavailable. It translates between
domains (satellite and maps) without requiring direct pixel-to-pixel alignment.
Facial recognition is a type of computer vision that uses optical input to analyze images and identify
faces. It's a form of artificial intelligence (AI) that mimics the human ability to recognize faces. Facial
recognition software uses AI, image recognition, and other advanced technologies to map, analyze, and
confirm a face's identity.
Detection is the process of finding a face in an image. Enabled by computer vision, facial recognition can
detect and identify individual faces from an image containing one or many people's faces.
Real World Computer Vision Applications
Facial recognition is a system used to identify a person by analyzing the individual's facial features, and the
term also refers to the software that automates the process. It scans the person's face, notes key characteristics,
and compares it to another image stored in a database. If the images match, the system confirms the identity.
Two broad categories used to classify facial recognition software are holistic and feature-based:
Holistic models examine your entire face and compare your features to those in images stored in a database.
A feature-based model analyses your face more deeply—for example, considering measurements between
features and the contours of bones.
First, detection and recognition are different tasks. Face detection is the crucial part of face recognition
determining the number of faces on the picture or video without remembering or storing details. It may define
some demographic data like age or gender, but it cannot recognize individuals.
Face recognition identifies a face in a photo or a video image against a pre-existing database of faces. Faces
indeed need to be enrolled into the system to create the database of unique facial features. Afterward, the
system breaks down a new image into key features and compares them against the information stored in the
database.
Facial recognition software typically follows a three-step process: detection, analysis, and recognition.
Detect: In the first step, the program searches through an image looking for facial data. It views faces
from the front and side, looking for distinctive features to analyze in the next step.
Real World Computer Vision Applications
Analyze: After identifying a face in an image, the program examines facial landmarks like the distance
from the chin to the forehead and between the eyes. It also considers the shape of different features like
the cheekbones, lips, ears, and more.
Recognize: In the final step of the process, the facial recognition program applies what it's learned
from the data to verify an individual's identity. It may compare the current image under analysis with a
stored image like one used on a government ID.
Face detection uses machine learning (ML) and artificial neural network (ANN) technology, and plays an
important role in face tracking, face analysis and facial recognition. In face analysis, face detection uses facial
expressions to identify which parts of an image or video should be focused on to determine age, gender and
emotions. In a facial recognition system, face detection data is required to generate a faceprint and match it
with other stored faceprints.
Face detection algorithms typically start by searching for human eyes, one of the easiest features to detect.
They then try to detect facial landmarks, such as eyebrows, mouth, nose, nostrils and irises. Once the
Real World Computer Vision Applications
algorithm concludes that it has found a facial region, it does additional tests to confirm that it has detected a
face.
To ensure accuracy, the algorithms are trained on large data sets that incorporate hundreds of thousands of
positive and negative images. The training improves the algorithms' ability to determine whether there are
faces in an image and where they are.
Face detection software detects faces by identifying facial features in a photo or video using machine
learning algorithms. It first looks for an eye, and from there it identifies other facial features. It then
compares these features to training data to confirm it has detected a face.
First, the computer examines either a photo or a video image and tries to distinguish faces from any other
objects in the background. There are methods that a computer can use to achieve this, compensating for
illumination, orientation, or camera distance. Yang, Kriegman, and Ahuja presented a classification for face
detection methods. These methods are divided into four categories, and the face detection algorithms could
Face detection software uses several different methods, each with advantages and disadvantages:
Knowledge- or rule-based. These approaches describe a face based on rules. Establishing well-defined,
knowledge-based rules can be a challenge, however.
Feature-based or feature-invariant. These methods use features such as a person's eyes or nose to detect a
face. They can be negatively affected by noise and light.
Template matching. This method is based on comparing images with previously stored standard face
patterns or features and correlating the two to detect a face. However, this approach struggles to address
variations in pose, scale and shape.
Appearance-based: This method uses statistical analysis and ML to find the relevant characteristics of face
images. The appearance-based method can struggle with changes in lighting and orientation.
Face detection algorithms are a key component of computer vision and are used to identify
and locate human faces in digital images or videos. These algorithms are the foundation for a
wide range of applications, including facial recognition, emotion detection, and security
systems. Here's an overview of some common face detection algorithms:
1. Haar Cascades
- Approach: Haar Cascades use machine learning to train a cascade function from a large
number of positive and negative images. The algorithm then detects faces by scanning the
image at different scales and positions.
Real World Computer Vision Applications
- Cons: Can be less accurate, particularly with faces at different angles or under varied
lighting conditions.
- Approach: HOG detects faces by capturing the structure of the human face using
gradients in the image. It divides the image into small regions and computes the gradient
orientation for each region.
- Cons: Can be computationally expensive and may struggle with complex backgrounds.
- Approach: CNN-based face detection algorithms use deep learning techniques to learn
features from a large set of labeled face images. These models are trained on large datasets
and can achieve high accuracy in detecting faces, even in challenging conditions.
- Pros: Highly accurate, able to detect faces in various poses and lighting conditions.
- Cons: Requires a large amount of computational power and data for training.
- Approach: YOLO is a real-time object detection system that divides an image into a grid
and predicts bounding boxes and class probabilities for each grid cell. It is often used for fast
face detection in videos.
- Cons: May miss small faces in the image due to its grid-based approach.
- Approach: SSD is another deep learning-based object detection method that can detect
multiple objects, including faces, in an image in a single shot. It uses a series of convolutional
layers to predict the bounding boxes and classes.
- Pros: Balances speed and accuracy, works well for real-time detection.
- Cons: May be less accurate than other deep learning models like Faster R-CNN.
- Approach: This technique involves detecting key points on a face, such as the eyes, nose,
and mouth, and then using these landmarks to identify the face. Algorithms like Dlib's 68-
point facial landmark detector are commonly used.
7. Viola-Jones Algorithm
- Approach: This is one of the earliest and most well-known face detection algorithms,
using a combination of simple rectangular features, integral images, and a cascaded classifier
to detect faces.
- Cons: Can struggle with faces that are not frontal or have complex backgrounds.
These algorithms form the basis for more advanced tasks like facial recognition, expression
analysis, and other biometric applications.
Real World Computer Vision Applications
Face Detection Implementation
Test Photograph
a "test photograph" refers to an image used for evaluating the performance of an algorithm or model. These
images are essential in the development and testing of computer vision systems, such as object detection,
face recognition, image classification, and more. Here's how test photographs are typically used in computer
vision:
1. Model Evaluation
Purpose: Test photographs are used to assess how well a computer vision model performs on unseen
data. After training a model on a set of images (training set), the test photographs (test set) are used
to evaluate its accuracy, precision, recall, and other performance metrics.
Example: After training a face detection algorithm, a set of test photographs containing faces in
various poses, lighting conditions, and backgrounds is used to evaluate how accurately the model
detects faces.
2. Algorithm Benchmarking
Purpose: Test photographs are used to benchmark different algorithms by providing a standardized
set of images for comparison. This helps in determining which algorithm performs best under
specific conditions.
Example: A researcher might use the same test photographs to compare the performance of different
object detection algorithms (e.g., YOLO vs. SSD) on identifying objects in an image.
Purpose: Test photographs help determine how well a model generalizes to new data. A model that
performs well on the training data might not perform well on unseen test photographs if it has
overfitted to the training set.
Example: A model trained to classify dog breeds might be tested on photographs of dog breeds not
included in the training set to see how well it generalizes.
4. Cross-Domain Testing
Purpose: Test photographs from different domains (e.g., different environments, cultures, or types
of images) are used to evaluate how robust a computer vision model is across various contexts.
Example: A face recognition system might be tested on photographs from different countries to
ensure it performs well across different ethnicities and facial features.
Real World Computer Vision Applications
5. Dataset Evaluation
Purpose: Large test datasets composed of test photographs are often used to evaluate the overall
effectiveness of computer vision systems. Popular datasets like ImageNet, COCO (Common Objects
in Context), or MNIST are widely used benchmarks in the field.
Example: The COCO dataset contains a large set of test photographs with labeled objects, which are
used to evaluate object detection algorithms.
6. Error Analysis
Purpose: Test photographs are used to analyze and understand the types of errors a computer vision
model makes. This can help in improving the model by focusing on its weaknesses.
Example: If a model frequently misclassifies certain objects, test photographs showing those objects
can be used to investigate why the model is failing and how it can be improved.
Purpose: Test photographs that represent edge cases or challenging scenarios are used to see how
well a model handles difficult situations, such as occlusions, low lighting, or unusual perspectives.
Example: A self-driving car system might be tested using photographs of pedestrians partially
obscured by objects to see how well it detects them.
Purpose: Test photographs taken in real-world environments are used to evaluate how a computer
vision model performs outside of controlled lab conditions.
Example: A model developed for surveillance might be tested using photographs from real CCTV
footage to see how well it identifies people in varying conditions.
Diversity: Test photographs should cover a wide range of scenarios to ensure the model is robust.
Realism: Test images should resemble the real-world data the model will encounter in deployment.
Balance: The test set should be balanced in terms of classes and conditions to provide a fair
evaluation.
Unseen Data: Ideally, test photographs should not be part of the training dataset to ensure an
unbiased evaluation.
Real World Computer Vision Applications
OpenCV is one of the most popular and widely-used libraries for computer vision tasks.
However, there are several other libraries and frameworks available that offer alternatives to
OpenCV, each with its own set of features, strengths, and weaknesses.
TensorFlow
TensorFlow, developed by Google, is primarily known as a deep learning framework.
However, it also provides a comprehensive set of tools and APIs for computer vision tasks
through its TensorFlow Image Processing (TF Image) module. TensorFlow offers high-level
abstractions for building and training deep neural networks for image classification, object
detection, segmentation, and more.
PyTorch
PyTorch, developed by Facebook, is another popular deep learning framework widely used in
the research community. PyTorch offers a flexible and intuitive interface for building custom
neural networks for various computer vision tasks. It provides dynamic computation graphs,
making it easy to experiment with different network architectures and algorithms.
scikit-image
scikit-image is a Python library specifically designed for image processing tasks. It provides
a collection of algorithms and functions for image filtering, feature extraction, segmentation,
and more. scikit-image is built on top of NumPy, making it easy to integrate with other
scientific computing libraries in the Python ecosystem.
Real World Computer Vision Applications
Dlib
Dlib is a C++ library that offers a wide range of tools and algorithms for machine learning,
computer vision, and image processing. It is known for its robust implementation of facial
landmark detection, object tracking, and facial recognition algorithms. Dlib also provides
Python bindings for easy integration into Python projects.
Simple Cv
SimpleCV is a Python framework designed to make computer vision tasks accessible to
beginners and non-experts. It provides a high-level interface for common computer vision
tasks, such as image acquisition, processing, feature extraction, and object detection.
SimpleCV abstracts away much of the complexity involved in computer vision, making it
suitable for rapid prototyping and experimentation.
Caffe
Caffe is a deep learning framework developed by Berkeley AI Research (BAIR). While it is
primarily focused on deep learning tasks, Caffe also includes modules for computer vision
tasks such as image classification, object detection, and segmentation. Caffe is known for its
speed and efficiency, particularly in training large-scale convolutional neural networks
(CNNs).
MXNet
MXNet is a deep learning framework that offers support for both symbolic and imperative
programming models. It provides a comprehensive set of tools and APIs for building and
deploying deep learning models for computer vision tasks. MXNet’s flexibility and
scalability make it suitable for both research and production environments.
Face detection algorithms are specialized computer vision algorithms designed to identify and locate human
faces within images or videos. These algorithms are foundational for various applications, including facial
recognition, emotion detection, security systems, and human-computer interaction. Here’s an overview of
some commonly used face detection algorithms:
Real World Computer Vision Applications
1. Haar Cascades
Method: CNNs are deep learning models that automatically learn features from large datasets of
images. These models are particularly effective for face detection due to their ability to capture
complex patterns in the data.
Examples:
o MTCNN (Multi-Task Cascaded Convolutional Networks): Detects faces and facial
landmarks in a multi-stage process.
o RetinaFace: A state-of-the-art face detector that provides high accuracy by combining face
detection with keypoint localization.
Advantages:
o High accuracy, especially in detecting faces under varying poses and lighting conditions.
Real World Computer Vision Applications
o Can detect small faces and faces in challenging conditions.
Disadvantages:
o Requires significant computational resources and large datasets for training.
o More complex to implement and fine-tune compared to traditional methods.
Method: YOLO is a real-time object detection system that divides an image into a grid and predicts
bounding boxes and class probabilities directly from the full images in a single evaluation.
Advantages:
o Extremely fast, making it suitable for real-time applications.
o Can detect multiple faces and objects in an image simultaneously.
Disadvantages:
o May miss smaller faces due to its grid-based approach.
o Less accurate compared to some other deep learning-based models in certain scenarios.
Method: SSD is a deep learning-based object detection model that predicts bounding boxes and
object classes in a single pass through the network. It uses a series of convolutional layers to detect
faces at multiple scales.
Advantages:
o Balanced trade-off between speed and accuracy.
o Suitable for detecting faces in real-time.
Disadvantages:
o May not be as accurate as models like Faster R-CNN for complex images.
o Can struggle with very small or very large faces.
6. Faster R-CNN
Method: Faster R-CNN is an advanced deep learning model that uses a Region Proposal Network
(RPN) to propose candidate object regions, followed by a classifier that refines these regions and
classifies them.
Advantages:
o High accuracy in detecting faces, even in challenging conditions.
o Effective in detecting small faces and faces with occlusions.
Disadvantages:
Real World Computer Vision Applications
o Computationally intensive, making it less suitable for real-time applications without
specialized hardware.
o Slower compared to models like YOLO and SSD.
Method: Instead of detecting the entire face, facial landmark detection algorithms identify key
points on the face, such as the eyes, nose, mouth, and chin. These landmarks can then be used to
infer the presence and orientation of a face.
Advantages:
o Provides detailed information about face orientation and expression.
o Useful for applications like face alignment and emotion recognition.
Disadvantages:
o More computationally intensive than simple face detection.
o Requires accurate landmark localization, which can be challenging in some conditions.
Security and Surveillance: Used in security cameras and access control systems to detect and track
individuals.
Social Media: Automatic tagging and photo organization.
Healthcare: Monitoring patient conditions and expressions.
Automotive: Driver monitoring systems in vehicles to detect drowsiness or distraction.
Retail: Customer behavior analysis and targeted advertising.
Object Detection:
what is Object Detection ?
Object detection is a computer vision technique that uses machine learning or deep learning to
locate and classify objects in images or videos. The goal is to develop computational models
that can answer the fundamental question, "What objects are where?"
Object detection is a technique that uses neural networks to localize and classify objects in images. This
computer vision task has a wide range of applications, from medical imaging to self-driving cars.
Object detection can be used in many areas, including: Medical imaging, Self-driving cars,
Image retrieval, Video surveillance, and Food manufacturing.
To train an object detection model, you need to create a neural network and show it images of an
object in different scenarios. You then label the object and its location.
Or
Finding Clues: The computer looks for clues like shapes, colors, and patterns in the picture.
Guessing What’s There: Based on those clues, it makes guesses about what might be in the picture.
Checking the Guesses: It checks each guess by comparing it to things it already knows.
Drawing Boxes: If it’s pretty sure about something, it draws a box around it to show where it thinks the
object is.
Making Sure: Finally, it double-checks its guesses to make sure it got things right and fix any mistakes
Convert to TFRecord format: If using TensorFlow, convert the annotations to TFRecord format.
Create a configuration file: Configure the training process.
Export the model: Save the model for use in other applications.
Tools and resources
TensorFlow Object Detection API: A popular framework for training object detectors.
Detecto: A library that allows training a model with just a few lines of code.
TensorFlow Lite Model Maker: A tool for training object detection models for edge devices.
YOLOv7: A step-by-step guide for training a custom object detector with YOLOv7.
Custom Vision Service: A Microsoft Azure service for building object detectors.
MATLAB: A toolbox that allows training custom object detectors.
What is OCR?
Optical character recognition (OCR) is a technology that uses automated data extraction to quickly convert
images of text into a machine-readable format.
OCR is sometimes referred to as text recognition. An OCR program extracts and repurposes data from
scanned documents, camera images and image-only PDFs. OCR software singles out letters on the image,
puts them into words, and then puts the words into sentences, thus enabling access to and editing of the
original content. It also eliminates the wasted effort of redundant manual data entry.
OCR systems use a combination of hardware and software to convert physical, printed documents into
machine-readable text. Hardware, such as an optical scanner or specialized circuit board, copies or reads
text, then software typically handles the advanced processing.
OCR software can take advantage of artificial intelligence (AI) to implement more advanced methods of
intelligent character recognition (ICR) for identifying languages or handwriting. Organizations often use the
Real World Computer Vision Applications
process of OCR to turn printed legal or historical documents into PDF documents so that users can edit,
format and search the documents as if created with a word processor.
ebookHow to choose the right AI foundation model
Learn how to choose the right approach in preparing data sets and employing AI models, plus how to use the
model selection framework to balance performance cost, risks and deployment needs.
Image acquisition: All document pages are copied and then the OCR engine converts the digital document
into a two-color or black-and-white version. The scanned-in image or bitmap is analyzed for light and dark
portions. The program then identifies the dark portions as characters that need to be recognized, while light
areas are identified as background.
Preprocessing: The digital image is cleaned to remove extraneous pixels. This preprocessing can include
deskewing to correct for the image being improperly aligned during scanning, removing graphic rules and
boxes that were part of the printed image and determining whether script text is included.
Text recognition: The dark portions are processed to find alphabetic letters, numeric digits or symbols. This
stage typically involves targeting one character, word or block of text at a time. Characters are then
identified by using one of two algorithms, either pattern recognition or feature recognition.
Pattern recognition (or pattern matching): The OCR program has previously been trained on
examples of text in various fonts and formats to recognize characters by comparison to a template in
the scanned document or image file. Each unique combination of shape, scale and font is called a
glyph. For this to work, the characters must be in a font that the OCR program has already been
trained on. Given the number of fonts worldwide and languages that use different
characters, such as Arabic, Chinese, English, French, German, Greek,
Japanese, Korean or Spanish, training on every combination of font and language would be an
enormous system drain.
Real World Computer Vision Applications
Feature recognition (detection or extraction): This is used when the OCR program is analyzing a font
that it has not been trained on. OCR applies rules regarding the features of a specific letter or number
to recognize characters in the scanned document. Features include the number of angled lines, line
intersections, loops or curves in a character. For example, the capital letter “A” is stored as two
diagonal lines that meet with a horizontal line across the middle. When a character is identified, it is
converted into an American Standard Code for Information Interchange (ASCII) code that computer
systems use to handle further manipulations.
Layout recognition: A more complete OCR program will also analyze the structure of a document image. It
divides the page into elements, such as blocks of text, tables or images. The lines are divided into words and
then into characters. After the characters have been singled out, the program compares them with a set of
pattern images. After processing all likely matches, the program returns the recognized text.
Postprocessing: The gathered information is stored as a digital file, either in an editable form or PDF.
Some systems retain both the input image and the post-OCR versions for easier comparison and more
complete document management.
Types of OCR
There are 4 types of OCR programs, with increasing sophistication:
Simple OCR: Analysis is character-by-character pattern-matching, comparing scanned characters to the
stored glyphs. With so many potential font and language combinations, the types of documents that can be
analyzed are limited.
Optical mark recognition (OMR): For identifying checked boxes and other marks, such as bubbles in
surveys or a signature on a form, plus logos, symbols and watermarks. All can be identified by matching to
stored images, as with simple OCR.
Intelligent character recognition (ICR): As mentioned previously, ICR brings in the power of AI. By
using ML or deep learning, the OCR program learns to read just as humans do: through continual practice
and training. A neural network reviews text repeatedly looking for distinctive attributes: the locations of
curves, intersections, lines and loops.
Intelligent word recognition: This is the natural evolution of the previous ICR recognition, but now the AI
has been trained to recognize a word in a single image, making it ultimately faster.
The benefits of OCR
The benefits of employing OCR technology include the ability to:
Cut costs by reducing or eliminating redundant manual input.
Streamline workflows with the input of preprinted documents or written forms and speed research
with searchable digital data.
Automate document routing, content processing and preparation for text mining.
Centralize and secure data sets for protection against fires, break-ins and documents lost in the bank
vaults.
Enable greater access to data for visually impaired staff and customers.
Real World Computer Vision Applications
Improve service by giving employees the most up-to-date and accurate information.
**********************