0% found this document useful (0 votes)
127 views30 pages

A41174 - Vision AI When Data Is Expensive and Constantly Changing

The document discusses challenges in developing real-world computer vision applications when data is expensive and constantly changing. It introduces Metropolis, a set of microservices and reference applications to help build complex vision AI solutions. As an example, a self-checkout application is described that uses microservices for feature extraction, similarity search, and embedding to allow kiosks to visually confirm products with limited labeled data. The application architecture and workflow demonstrate how to adapt models during deployment.

Uploaded by

henrydcl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views30 pages

A41174 - Vision AI When Data Is Expensive and Constantly Changing

The document discusses challenges in developing real-world computer vision applications when data is expensive and constantly changing. It introduces Metropolis, a set of microservices and reference applications to help build complex vision AI solutions. As an example, a self-checkout application is described that uses microservices for feature extraction, similarity search, and embedding to allow kiosks to visually confirm products with limited labeled data. The application architecture and workflow demonstrate how to adapt models during deployment.

Uploaded by

henrydcl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Vision AI when Data is Expensive

and Constantly Changing


Khoa Ho, Product Manager
Ethem Can, Deep Learning Engineer
Real-World Vision AI is Complicated
It’s not just tackling a benchmark

Industry problems
(Retail, …)

Several Ks to 100K
Constantly changing
Unique

1k classes
Static
Common
E2E Vision App is Complex
It’s not just a model

Accurate and Real-time Full app with complex


efficient AI model perception pipeline business logic

And that’s just the development.


What about deployment and maintenance?
Complex Apps Need Better Building Blocks
Introducing Metropolis Microservices & Reference Apps

Customer Vision AI Solutions & Apps

Metropolis

Intelligent Traffic Multi-Camera Tracking Smart Self-Checkout Industrial Inspection Robot-Human Interaction

Microservices
Detection & Tracking Behavior Analytics Similarity Search

Multi-Camera Tracking Sensor Fusion …

Pre-Trained Models TAO Toolkit DeepStream SDK Video & Storage Toolkit

CUDA-X
Triton TensorRT RAPIDS DALI VPI

Edge Cloud
Self-Checkout AI Copilot
A problem statement/ A reference application

• Goal: Allow existing kiosks to visually confirm products


• Require:
• A typical kiosk
• Camera(s)
• AI system can adapt during deployment, using limited
data of each class & often without retraining
• Semi-automated labeled data acquisition process

Camera(s)

Evaluation &
Visual Perception Correction Monitoring & Alert
Barcode
Scanner
App Architecture & Workflow
Typical scenarios

Model Fine-
Camera(s)
tune / Retrain

Similarity Visual prediction


Feature Extraction
Search

Database Ops Evaluator

Barcode Embedding
Scanner Database

Barcode signal

• Visual prediction & barcode signal agrees: Continue.


• Visual prediction & barcode signal disagrees,
• And visual system has
• Low confidence: Add item’s embedding to database.
• High confidence: Perform store-defined alerts/actions.

• Enough new data is collected: Retrain/fine-tune models.


App Workflow
Demo
Feature Extraction Pipeline
Designed for real-time processing

Model Fine-
Camera(s)
tune / Retrain

Similarity Visual prediction


Feature Extraction
Search

Database Ops Evaluator

Embedding
Barcode Database
Scanner

Barcode ID

# of streams
GPU Model precision
(1080p@30 fps)

Stream Localize Generate AGX Orin FP16 5


Decode Preprocess Crop
Mux Object Embedding

DeepStream app
Object Localization Model

Current Next steps


• A binary detector (object vs. non-object) • Explore turning it to a meta-class detector to scale
• I.e., an object localizer Original frame Detected frame
Original frame Detected frame
Dairy

• EfficientDet-D5 • Ex: 100 meta-classes of 1K products = 100K products

Source: EfficientDet: Scalable and Efficient Object Detection (2020)


Embedding Model

• Create embedding on cropped frame


• ResNet101+MLP
• Fine-tuned with triplet loss
Negative

Anchor

Anchor Positive

Learning
Embedding Model [42 17 36 … 90]
Positive Negative

Anchor
Negative Margin

ResNet101
MLP Positive
(Pre-trained)
Similarity Search
Accurate & fast image retrieval

Model Fine-
Camera(s)
tune / Retrain

Similarity Visual prediction


Feature Extraction
Search

Database Ops Evaluator

Embedding
Barcode Database
Scanner

Barcode ID
Similarity Search
Accurate & fast image retrieval

• Vector database for high throughput similarity search: • Search via k-nearest neighbor
• k = 1 works well in our experiments so far

Embedding Generation

Reference space Query space

Blue: Seen classes


Red: Unseen classes

Source: [Adapted from https://milvus.io/blog/scalable-and-blazing-


fast-similarity-search-with-milvus-vector-database.md]
Retrieval-Based Classification
Embedding Model + Similarity Search

Tested on AliProducts subset: Real life problems are even more challenging:
• Have to detect/localize items
(AliProducts challenge is classification-only)
• Highly diverse & dynamic backgrounds
• Any possible item orientation (not just frontal)
• Varied distance from the camera
• Runtime performance is also critical
(so, no huge model, no ensembling like in competition)
Source: https://retailvisionworkshop.github.io/recognition_challenge_2020/

• 3K classes trained
• 1K classes unseen
• With 100 images/class as references in embedding DB,
classification accuracy:
• 88% (seen)
• 84% (unseen)
Combined Localization & Classification
Feature Extraction Pipeline + Similarity Search

Tested on internal self-checkout dataset:


Classification accuracy on unseen classes

Classification-only Localization+Classification
1 10, 0.90
0.9
10, 0.79
0.8

0.7

0.6

0.5
• 112 retail products (12 unseen)
0.4
• 3 persons scanning
0.3
• Trained on 100-300 images/class for 100 seen classes 0.2

• Tested on unseen classes 0.1

0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

# of embeddings per product in database


Synthetic Data Augmentation
Generate with Omniverse Replicator

• Augment existing real datasets to improve dataset size & diversity


• Multiple variables & strategies in generating synthetic data
• While sim2real gap is open research, combining all below works well in our experiments so far

Data Diversity

• Real NVImageNet images


as backgrounds
• Variance in bbox locations
• Photorealistic, physically
based 3D environment:
• More natural appearance
• Abstract backgrounds • More natural object-scene
• Flying distractors interaction
• Domain randomizations:
• Background texture
• Camera view
• Product orientation
• Real self-checkout
scenes as backgrounds

Data Fidelity
Synthetic Data Augmentation
Improve accuracy & generalizability of application

• Better transferability to new environment, • Evaluate product recognition accuracy, when trained
• Especially on unseen products with:
• Real data-only
• Real & synthetic data
• For example, new test scenario:
Dataset Size Test Accuracy
Train Data Sources
Real Synthetic Seen Unseen
Real 50K - 79% 14%
Real + Synthetic 50K 100K 94% 70%

• Same 112 products (12 unseen) Real Synthetic


• 2 different persons
• Slightly different camera view

• Same models & system architecture


• 100 images/class as references in embedding DB
Evaluator
Many metrics for optimal operation

Model Fine-
Camera(s)
tune / Retrain

Similarity Visual prediction


Feature Extraction
Search

Database Ops Evaluator

Embedding
Barcode Database
Scanner

Barcode ID
Evaluator
Many metrics for optimal operation

Per-prediction metrics: Per-class metrics:


• Similarity search distance (vs. user-defined threshold) • Number of embeddings in DB
• Diversity of embeddings (variance, etc.)
• Percentage of mismatches or low-confidence matches

Overall app metrics:


• Moving average of
• Similarity search consistency across frames • Bounding box prediction confidence
• Top-k similarity search distance
• Bounding box prediction confidence
• Percentage of mismatches or low-confidence matches
Deployment Architecture
Composed of Metropolis Microservices

Camera(s)

RTSP
Video Mgmt. & Storage Redis Message Broker Kibana Dashboard

HTTP

Feature Extraction Similarity Search Evaluator Logstash + Elasticsearch Web APIs

Embedding DB Web UI
Barcode Sensor Monitoring
Scanner

Evaluation
Visual Perception
& Correction

Compute Media Middleware


Reference Deployment
Optimized for latency & efficiency

• Perception microservices are distributed


• Update engine and other monitoring ops are centralized

Sensor(s) → Perception

Edge Device
Embedding DB

Evaluation & Correction

A customer Sensor(s) → Perception


or an associate Edge Server | Cloud Operator
Edge Device or data scientist

Sensor(s) → Perception

Edge Device
Reference Deployment
Designed for scaling

• Multiple stores can synchronize their embedding databases

On-prem
or cloud

Embedding DB

Embedding DB

DB Synchronization*

Cloud Operator
Embedding DB
or data scientist

Embedding DB

* Roadmap
Reference Deployment
As a data acquisition platform

• In one scan, create labeled data for multiple camera views


• Data foundation for other smart store projects

Data or Pre-trained Model

Smart self-checkout

Cashierless checkout
Customizability
Multiple options at all levels

Level Options

Add sensors
Scene Adjust operation parameters
(when the app should learn & when to act)

Model Fine-tune the models


Use your own models

Modify the application graph


App Integrate into your own app

Microservice Modify the microservice code


No-Code Model Training
TAO Toolkit
No-Code App Development
UCF Studio
App
UCFWorkflow
Studio
Demo
Demo
APIs for App Integration
More are coming

API Function
start_fsl starts the system
start_services starts the Milvus, Redis, and ELK services​
stop_services stops the Milvus, Redis, and ELK services​
add_client_metadata starts the service that adds client metadata
similarity_search starts similarity search service​
pulse_check checks if the Milvus, Redis, and ELK services are running​
gathers image and label of unseen items flagged by the system, to be
compile_unseen_images
used for improving similarity search or finetuning embedding model
analyze_matches generates an analysis report of object matches
similarity_search_ops various operations for the similarity search DB
generates embeddings of input images and inserts them into the
insert_new_item_to_sim_db
embedding database
generate_synthetic_data* generates synthetic data​
finetune* finetunes an existing model with new data​
deploy_model* replaces an existing model with a new one in the deepstream app​

* Roadmap
Looking Forward
Features & problems we’re working on

Accuracy Performance Usability

More synthetic data to increase Model pruning & INT8 More metrics for better
generalizability quantization evaluation and correction

Combine multiple frame predictions GPU-accelerated similarity Auto-correction mode


for more robustness search

Predict from multiple views REST APIs

Transformer-based models

Meta-class object detector


Summary

Accelerate App Development with Self-Checkout AI Copilot Deploy from edge to cloud with
Metropolis Microservices Starting point for smart retail cloud-native technology

Coming Soon: Dec 2022


Early Sign Up: developer.nvidia.com/metropolis-microservices

You might also like