ML Study Design - Google Street View Blurring System
ML Study Design - Google Street View Blurring System
Objective:
What is an RPN?
A Region Proposal Network (RPN) is a critical component in object detection systems like
Faster R-CNN. Its role is to generate proposals—regions in an image that are likely to contain
objects—quickly and accurately.
How it works:
RPN enables end-to-end training, allowing the model to jointly optimize region
proposals and object classification.
It’s computationally efficient, as it operates directly on the feature map without requiring
separate sliding window or region extraction steps.
YOLO
YOLO (You Only Look Once) is a single-stage object detection system that performs detection
directly on a grid overlaid on the input image. Here's a concise breakdown:
1. Grid Division: The input image is divided into an S x S grid (e.g., 7x7).
2. Cell Predictions: Each grid cell is responsible for predicting:
o B bounding boxes: Each bounding box has:
(x, y): Center coordinates relative to the cell.
(w, h): Width and height relative to the image.
Confidence score: Probability of an object being in the box and the box
being accurate.
o C class probabilities: A probability distribution over the C object classes.
3. Encoding: These predictions are encoded into a tensor of size S x S x [B * 5 + C], where
5 represents the 4 bounding box coordinates + the confidence score.
4. Non-Max Suppression (NMS): Because multiple cells might detect the same object,
NMS is used to filter out redundant bounding boxes, keeping only the ones with the
highest confidence scores and sufficient IoU (Intersection over Union).
5. Loss Function: YOLO uses a loss function that combines:
o Bounding box regression loss (how well the predicted boxes match the ground
truth).
o Confidence loss (how accurate the objectness predictions are).
o Classification loss (how accurate the class predictions are).
No Region Proposals: YOLO directly predicts bounding boxes and class probabilities
without a separate region proposal step. This is what makes it much faster.
Grid-Based Detection: Detection happens at the grid cell level. Each cell is responsible
for predicting objects whose centers fall within it.
Some other models
Example of per-class evaluation and mAP for a two-stage detector:
Two classes ("cat", "dog") and two images are used for a simplified example.
Image 1: 2 cats, 1 dog (ground truth). Predictions: 3 cat predictions (2 correct), 1 correct
dog prediction.
Image 2: 1 cat, 2 dogs (ground truth). Predictions: 1 correct cat prediction, 3 dog
predictions (2 correct).
Cat: Image 1: P=0.67, R=1.0. Image 2: P=1.0, R=1.0. AP (simplified average): 0.835
Dog: Image 1: P=1.0, R=1.0. Image 2: P=0.67, R=1.0. AP (simplified average): 0.835
Pry's Questions:
●Question: How do you determine what to blur when trying to protect privacy, considering that
blurring the whole face might be excessive?
●Answer: Jit did not directly address this question. However, Prasa and he agreed that blurring
specific features like eyes and noses could suffice for de-identification while reducing processing
demands.
●Question: How does the system differentiate overlapping objects, like a person in a vehicle,
when applying different blurring to each?
●Answer: Jit didn't explicitly answer this. However, his presentation explained that the system
uses bounding boxes and assigns object classes (person, car, etc.) to each detected region. This
suggests that the model can distinguish and apply different blurring to overlapping objects based
on their classifications.
1. Bounding Box Regression: Predicting separate bounding boxes for each object in the
overlap region.
2. Object Classification: Assigning class probabilities (e.g., person, vehicle) to each
predicted bounding box.
3. Non-Maximum Suppression (NMS): Ensuring that overlapping boxes with lower
confidence scores are suppressed, retaining the most confident predictions for distinct
objects.
4. Feature Maps: Utilizing spatial and contextual features from the image to accurately
separate and classify objects even in overlapping scenarios.
●Question: Since Google Street View is a live view, how does the system process real-time
images and select frames for blurring?
●Answer: Jit clarified that Google captures panoramic images and stitches them together to
create a route view. This process suggests the blurring happens on static images rather than live
video streams. Pry and Prasa discussed the complexities of handling live video streams,
proposing frame rate reduction and selective image processing as potential solutions.