0% found this document useful (0 votes)
58 views11 pages

ML Study Design - Google Street View Blurring System

The document outlines the design and objectives of a machine learning study focused on a Google Street View blurring system, addressing key questions about existing systems, data quality, deployment, and success metrics. It details the workings of a Region Proposal Network (RPN) and YOLO (You Only Look Once) for object detection, emphasizing their efficiency and differences. Additionally, it includes a discussion on data augmentation, privacy concerns in blurring, and the need for human intervention in training datasets.

Uploaded by

aleb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views11 pages

ML Study Design - Google Street View Blurring System

The document outlines the design and objectives of a machine learning study focused on a Google Street View blurring system, addressing key questions about existing systems, data quality, deployment, and success metrics. It details the workings of a Region Proposal Network (RPN) and YOLO (You Only Look Once) for object detection, emphasizing their efficiency and differences. Additionally, it includes a discussion on data augmentation, privacy concerns in blurring, and the need for human intervention in training datasets.

Uploaded by

aleb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

ML Study Design - Google Street View Blurring System

Objective:

- Is there an existing system already in place?


- What is the main objective: accuracy, latency?
- Is there labeled data? If so, how much
- Can we trust the quality of data coming in?
- Where the model will be deployed? [cloud, edge, etc]
- Are there any constraints or limitations on model complexity or size?
- What are the key success metrics for this ML system in the long term?
In a given image, first objects are located and then the bounding boxes are
created around the object.

What is an RPN?

A Region Proposal Network (RPN) is a critical component in object detection systems like
Faster R-CNN. Its role is to generate proposals—regions in an image that are likely to contain
objects—quickly and accurately.

How Does It Work?


1. Feature Extraction:
o The RPN uses a feature map produced by a convolutional neural network (CNN)
as input.
o This feature map captures essential details about the image, such as edges,
textures, and patterns.
2. Anchor Boxes:
o At each position in the feature map, RPN places predefined anchor boxes—
rectangles of various sizes and aspect ratios.
o These boxes act as potential candidates for object regions.
o Example: For a 10x10 feature map with 3 scales and 3 aspect ratios, RPN creates
10×10×9=900anchor boxes.
3. Objectness and Bounding Box Refinement:
o For each anchor box, RPN predicts:
 Objectness Score: How likely it is that the box contains an object (vs.
background).
 Bounding Box Adjustment: Precise shifts to the box’s position, width,
and height to better match the object.
o Example: An anchor box at (5, 5) with size 32x32 pixels might be refined to:
 Objectness score: 0.9 (high confidence)
 Adjusted box: Center (5.2, 5.1), Size 30x40 pixels.
4. Proposal Generation:
o Filtering: Anchor boxes with low objectness scores are discarded.
o Refinement: The remaining boxes are adjusted using the bounding box
predictions. An anchor box might be at (x=100, y=150) with width 50 and height
30. The RPN might predict dx=2, dy=-1, dw=4, dh=-2. The refined box would
then be at (x=102, y=149) with width 54 and height 28.
o Non-Maximum Suppression (NMS): Overlapping proposals are merged, leaving
only the highest-quality ones.

How it works:

 Sort the remaining proposals by their objectness scores in descending


order.
 Select the proposal with the highest score.
 Compare this proposal with all other remaining proposals.
 If the Intersection over Union (IoU) between the selected proposal and any
other proposal is greater than a certain threshold (e.g., 0.7), discard the
other proposal (because it's considered redundant).
 Repeat steps 2-4 until all proposals have been considered.
 Intersection over Union (IoU): IoU is the ratio of the area of overlap
between two bounding boxes to the area of their union. A high IoU means
the boxes overlap significantly.
 Effect: NMS ensures that each object is represented by only one (or a
few) high-quality proposal(s), further reducing the number of proposals
passed to the next stage of the object detection pipeline.
Why is RPN Important?

 RPN enables end-to-end training, allowing the model to jointly optimize region
proposals and object classification.
 It’s computationally efficient, as it operates directly on the feature map without requiring
separate sliding window or region extraction steps.

YOLO

YOLO (You Only Look Once) is a single-stage object detection system that performs detection
directly on a grid overlaid on the input image. Here's a concise breakdown:

1. Grid Division: The input image is divided into an S x S grid (e.g., 7x7).
2. Cell Predictions: Each grid cell is responsible for predicting:
o B bounding boxes: Each bounding box has:
 (x, y): Center coordinates relative to the cell.
 (w, h): Width and height relative to the image.
 Confidence score: Probability of an object being in the box and the box
being accurate.
o C class probabilities: A probability distribution over the C object classes.
3. Encoding: These predictions are encoded into a tensor of size S x S x [B * 5 + C], where
5 represents the 4 bounding box coordinates + the confidence score.
4. Non-Max Suppression (NMS): Because multiple cells might detect the same object,
NMS is used to filter out redundant bounding boxes, keeping only the ones with the
highest confidence scores and sufficient IoU (Intersection over Union).
5. Loss Function: YOLO uses a loss function that combines:
o Bounding box regression loss (how well the predicted boxes match the ground
truth).
o Confidence loss (how accurate the objectness predictions are).
o Classification loss (how accurate the class predictions are).

Key Differences from Two-Stage Detectors:

 No Region Proposals: YOLO directly predicts bounding boxes and class probabilities
without a separate region proposal step. This is what makes it much faster.
 Grid-Based Detection: Detection happens at the grid cell level. Each cell is responsible
for predicting objects whose centers fall within it.
Some other models
Example of per-class evaluation and mAP for a two-stage detector:

Two classes ("cat", "dog") and two images are used for a simplified example.

 Image 1: 2 cats, 1 dog (ground truth). Predictions: 3 cat predictions (2 correct), 1 correct
dog prediction.
 Image 2: 1 cat, 2 dogs (ground truth). Predictions: 1 correct cat prediction, 3 dog
predictions (2 correct).

Using precision and recall (simplified, AP is usually from a curve):

 Cat: Image 1: P=0.67, R=1.0. Image 2: P=1.0, R=1.0. AP (simplified average): 0.835
 Dog: Image 1: P=1.0, R=1.0. Image 2: P=0.67, R=1.0. AP (simplified average): 0.835

mAP (average of per-class APs): (0.835 + 0.835) / 2 = 0.835


The "Hard negative mining" component likely processes the initial training results to identify and
select the most challenging false positives. These hard negatives are then used to augment the
dataset for subsequent training iterations.
Questions:

Dr. Mamta's Question:


●Question: Is there a way to understand the data and what kind of data augmentation can be
applied without misunderstanding the data, especially when there is limited data and human
intervention is undesirable?
●Answer: Jit acknowledged that data augmentation needs careful consideration. He explained
that in the Street View example, using completely upside-down images would be inappropriate
because they wouldn't occur in reality. He suggested focusing on realistic scenarios and data that
would likely be encountered. Prasa and Dr. Mamta highlighted the importance of human
intervention, particularly with limited datasets, to ensure appropriate augmentation and avoid
inaccurate results. They emphasized the need for a "human in the middle" to evaluate and select
suitable data for training to prevent false positives and negatives.

Pry's Questions:
●Question: How do you determine what to blur when trying to protect privacy, considering that
blurring the whole face might be excessive?
●Answer: Jit did not directly address this question. However, Prasa and he agreed that blurring
specific features like eyes and noses could suffice for de-identification while reducing processing
demands.
●Question: How does the system differentiate overlapping objects, like a person in a vehicle,
when applying different blurring to each?
●Answer: Jit didn't explicitly answer this. However, his presentation explained that the system
uses bounding boxes and assigns object classes (person, car, etc.) to each detected region. This
suggests that the model can distinguish and apply different blurring to overlapping objects based
on their classifications.

Single-stage object detection models differentiate overlapping objects by:

1. Bounding Box Regression: Predicting separate bounding boxes for each object in the
overlap region.
2. Object Classification: Assigning class probabilities (e.g., person, vehicle) to each
predicted bounding box.
3. Non-Maximum Suppression (NMS): Ensuring that overlapping boxes with lower
confidence scores are suppressed, retaining the most confident predictions for distinct
objects.
4. Feature Maps: Utilizing spatial and contextual features from the image to accurately
separate and classify objects even in overlapping scenarios.

●Question: Since Google Street View is a live view, how does the system process real-time
images and select frames for blurring?
●Answer: Jit clarified that Google captures panoramic images and stitches them together to
create a route view. This process suggests the blurring happens on static images rather than live
video streams. Pry and Prasa discussed the complexities of handling live video streams,
proposing frame rate reduction and selective image processing as potential solutions.

Prasa's Questions and Suggestions:


●Question/Suggestion: Can we discuss various object detection models, including YOLO (You
Only Look Once) and SSD (Single Shot Detection), and compare their strengths and
weaknesses?
●Answer/Response: Jit briefly mentioned YOLO and SSD as examples of one-stage object
detection networks5. Prasa emphasized the need to delve deeper into these models, including
RCNN variations (Fast RCNN and Faster RCNN), and compare their architectures and
performance characteristics.
●Question/Suggestion: Can we discuss the differences between commercial models, open-
source models, and models described in research papers?
●Answer/Response: This question wasn't answered directly. However, Prasa suggested
incorporating this comparison into future discussions to provide a broader perspective on the
landscape of object detection models.
●Question/Suggestion: Should we discuss ImageNet and COCO image databases, which are
widely used in the computer vision community?
●Answer/Response: This question was not addressed. However, Prasa's suggestion highlights
the value of exploring publicly available datasets to understand how models are trained and
evaluated.
●Question/Suggestion: Can we discuss the mathematical aspects of image processing, focusing
on techniques like data pre-processing and augmentation, and the libraries that support them?
●Answer/Response: While the presentation touched upon pre-processing techniques like
resizing and normalization, it didn't delve into the mathematical details9. Prasa advocated
exploring the mathematical underpinnings of these techniques and discussing libraries that
facilitate their implementation. He proposed exploring topics such as wavelet transforms,
contour analysis, and relevant libraries to enhance the understanding of image processing
techniques.

You might also like