A41174 - Vision AI When Data Is Expensive and Constantly Changing
A41174 - Vision AI When Data Is Expensive and Constantly Changing
Industry problems
(Retail, …)
Several Ks to 100K
Constantly changing
Unique
1k classes
Static
Common
E2E Vision App is Complex
It’s not just a model
Metropolis
Intelligent Traffic Multi-Camera Tracking Smart Self-Checkout Industrial Inspection Robot-Human Interaction
Microservices
Detection & Tracking Behavior Analytics Similarity Search
Pre-Trained Models TAO Toolkit DeepStream SDK Video & Storage Toolkit
CUDA-X
Triton TensorRT RAPIDS DALI VPI
Edge Cloud
Self-Checkout AI Copilot
A problem statement/ A reference application
Camera(s)
Evaluation &
Visual Perception Correction Monitoring & Alert
Barcode
Scanner
App Architecture & Workflow
Typical scenarios
Model Fine-
Camera(s)
tune / Retrain
Barcode Embedding
Scanner Database
Barcode signal
Model Fine-
Camera(s)
tune / Retrain
Embedding
Barcode Database
Scanner
Barcode ID
# of streams
GPU Model precision
(1080p@30 fps)
DeepStream app
Object Localization Model
Anchor
Anchor Positive
Learning
Embedding Model [42 17 36 … 90]
Positive Negative
Anchor
Negative Margin
ResNet101
MLP Positive
(Pre-trained)
Similarity Search
Accurate & fast image retrieval
Model Fine-
Camera(s)
tune / Retrain
Embedding
Barcode Database
Scanner
Barcode ID
Similarity Search
Accurate & fast image retrieval
• Vector database for high throughput similarity search: • Search via k-nearest neighbor
• k = 1 works well in our experiments so far
Embedding Generation
Tested on AliProducts subset: Real life problems are even more challenging:
• Have to detect/localize items
(AliProducts challenge is classification-only)
• Highly diverse & dynamic backgrounds
• Any possible item orientation (not just frontal)
• Varied distance from the camera
• Runtime performance is also critical
(so, no huge model, no ensembling like in competition)
Source: https://retailvisionworkshop.github.io/recognition_challenge_2020/
• 3K classes trained
• 1K classes unseen
• With 100 images/class as references in embedding DB,
classification accuracy:
• 88% (seen)
• 84% (unseen)
Combined Localization & Classification
Feature Extraction Pipeline + Similarity Search
Classification-only Localization+Classification
1 10, 0.90
0.9
10, 0.79
0.8
0.7
0.6
0.5
• 112 retail products (12 unseen)
0.4
• 3 persons scanning
0.3
• Trained on 100-300 images/class for 100 seen classes 0.2
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Data Diversity
Data Fidelity
Synthetic Data Augmentation
Improve accuracy & generalizability of application
• Better transferability to new environment, • Evaluate product recognition accuracy, when trained
• Especially on unseen products with:
• Real data-only
• Real & synthetic data
• For example, new test scenario:
Dataset Size Test Accuracy
Train Data Sources
Real Synthetic Seen Unseen
Real 50K - 79% 14%
Real + Synthetic 50K 100K 94% 70%
Model Fine-
Camera(s)
tune / Retrain
Embedding
Barcode Database
Scanner
Barcode ID
Evaluator
Many metrics for optimal operation
Camera(s)
RTSP
Video Mgmt. & Storage Redis Message Broker Kibana Dashboard
HTTP
Embedding DB Web UI
Barcode Sensor Monitoring
Scanner
Evaluation
Visual Perception
& Correction
Sensor(s) → Perception
Edge Device
Embedding DB
Sensor(s) → Perception
Edge Device
Reference Deployment
Designed for scaling
On-prem
or cloud
Embedding DB
Embedding DB
DB Synchronization*
Cloud Operator
Embedding DB
or data scientist
Embedding DB
* Roadmap
Reference Deployment
As a data acquisition platform
Smart self-checkout
Cashierless checkout
Customizability
Multiple options at all levels
Level Options
Add sensors
Scene Adjust operation parameters
(when the app should learn & when to act)
API Function
start_fsl starts the system
start_services starts the Milvus, Redis, and ELK services
stop_services stops the Milvus, Redis, and ELK services
add_client_metadata starts the service that adds client metadata
similarity_search starts similarity search service
pulse_check checks if the Milvus, Redis, and ELK services are running
gathers image and label of unseen items flagged by the system, to be
compile_unseen_images
used for improving similarity search or finetuning embedding model
analyze_matches generates an analysis report of object matches
similarity_search_ops various operations for the similarity search DB
generates embeddings of input images and inserts them into the
insert_new_item_to_sim_db
embedding database
generate_synthetic_data* generates synthetic data
finetune* finetunes an existing model with new data
deploy_model* replaces an existing model with a new one in the deepstream app
* Roadmap
Looking Forward
Features & problems we’re working on
More synthetic data to increase Model pruning & INT8 More metrics for better
generalizability quantization evaluation and correction
Transformer-based models
Accelerate App Development with Self-Checkout AI Copilot Deploy from edge to cloud with
Metropolis Microservices Starting point for smart retail cloud-native technology