SlideShare a Scribd company logo
ML Monitoring is not APM
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai
Agenda
▴ What is APM?
▴ What is ML monitoring?
▴ How ML monitoring and APM differ
▴ The unique needs of ML monitoring
▴ A very cool solution to model monitoring from Verta
About
https://www.verta.ai/product
- End-to-end MLOps platform for ML
model delivery, operations and
management
- Kubernetes-based, operations stack
for ML
- 23 years as a software engineer
- Embedded systems, enterprise
software, SaaS
- 6 years in APM working at scale
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
What is APM?
What is APM?
▴ Application performance Monitoring
▴ Metrics
○ Name
○ Value
○ Labels
○ Timestamp
▴ Visualization
▴ Alerting
What do I care about monitoring in APM?
▴ Health
▴ Availability
▴ Performance
▴ Stability
▴ Notification
APM in practice
▴ Production operations
▴ Diagnostics and debugging
▴ Critical incident response
What is Model Monitoring?
▴ Know when models are failing
▴ Quickly find the root cause
▴ Close the loop by fast recovery
10
Ensuring model results are
consistently of high quality
*We refer to all latency, throughput etc. as model service health
▴ w/o ground truth, model
fails challenging to detect
▴ Need to monitor complex
statistical summaries
▴ Distributions, anomalies,
missing values, quantiles
etc.
▴ Often model-specific
▴ Intelligent detection
and alerting to
pre-emptively identify
issues and trigger
remediations
▴ Execute re-trains,
fallback models, and
human intervention.
11
Know when a model fails Close the loop
▴ A model is one part of a
inference pipeline
▴ Need global view of the
pipeline jungle to see
where the root issue
may be
Quickly find the root cause
How APM and ML monitoring align
▴ Error rate, Throughput, Latency
○ You need to know my production systems are
operational
▴ Visualization
○ You need to see change over time
▴ Alerting
○ You need to know when
something has gone wrong
(and only when something
has gone wrong)
What do you care about in ML Monitoring?
▴ Distribution
○ Training versus test
○ Iteration over iteration
○ Live prediction
▴ Drift
○ Change in Distribution over
time
How APM and ML monitoring differ
▴ Error Rate, Throughput, Latency
○ Necessary, no longer sufficient
▴ Not all work is production work
○ ML monitoring happens from the beginning
of the pipeline
▴ APM can tell you what is wrong
○ ML monitoring is about understanding why
What makes ML monitoring unique
▴ Quantitative analysis of model performance
○ Information you can use
▴ Controlled comparison of distributions
○ Repeatable
○ Reliable
○ Consistent
▴ Alerting on meaningful deviation
○ Actionable
○ Timely
○ Accurate
Only you know the shape of your data
▴ Every model and pipeline is different and specialized
○ You built them, you understand them
▴ You know what metrics and distributions are valuable
○ This is your model, you know the data and processes that created it
▴ You know the expected distributions
○ You can determine whether the behavior is correct
Only you know how to measure change
▴ Compare to reference set
○ Training, test, golden data set
▴ Compare to a baseline
○ Calculate a baseline from your data or production systems
▴ Compare to other
○ Use a comparison that makes sense in your domain
Only you know when a change matters
▴ You know your model and tolerances
▴ You know when a deviation is significant (or not!)
▴ You know when these conditions need to change
Verta understand model monitoring
▴ Designed for your workflows
▴ Easy integration to capture your monitoring data
▴ Visualize and understand your metrics, distributions, and drift
▴ Get alerted when you should - not otherwise
Introducing a generalized
framework for Model Monitoring
Concepts
▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to
monitor
▴ Profiler: A function that computes statistics about your data
▴ Summary: A collection of statistics about your data (output of profiler)
○ Samples: instance of a summary, i.e., a statistic
○ Labels: key-values attached to summary samples. Used for rich filtering and
aggregation
▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information
about summaries and identify if they look wrong
How does it work?
1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline)
2. Define summaries to monitor for the entity
3. Run profilers (manually or automatically) to produce summary samples
4. View samples, define alerts
5. Get alerted (e.g. via Slack)
6. Close the loop!
How does it work?
Time-series DB for
statistical summaries
...
Ground truth
Data/Model
Pipelines
Model (Live)
Remediation
- Retrain
- Rollback
- Human loop
Model (Batch)
Prediction
Log
Summary
▴ Performance monitoring is no longer sufficient for the needs of modern ML systems
○ Model monitoring starts at the beginning of the pipeline and continues through production
○ Batch and live can be addressed in the same framework
▴ Knowing something is wrong is not enough, you need to know why
▴ Timely actionable alerting is mandatory
▴ Building these tools on-site is difficult, error-prone, and expensive
▴ Spark is a fantastic tool to enable model monitoring
Monitor Your Models with Verta
▴ Visit monitoring.verta.ai today and see it in action
▴ Join our community
▴ Get more out of your models
▴ Get more out of your alerts
Thank you.
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai

More Related Content

PPT
Disaster Recovery and the Cloud
PDF
Cloud Computing - An Introduction
PDF
Synopsis on inventory_management_system
PPTX
Migration to Alibaba Cloud
PDF
Azure AI platform - Automated ML workshop
PPTX
Migration to Aws Cloud
PPTX
Serverless Computing
PDF
Serverless ddd
Disaster Recovery and the Cloud
Cloud Computing - An Introduction
Synopsis on inventory_management_system
Migration to Alibaba Cloud
Azure AI platform - Automated ML workshop
Migration to Aws Cloud
Serverless Computing
Serverless ddd

What's hot (20)

PDF
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
PDF
[AWS Innovate 온라인 컨퍼런스] ML 모델 생성 및 운영 효율화를 높이는 Amazon SageMaker의 신규 기능들 - 남궁...
PPSX
A Seminar on NoSQL Databases.
PDF
Learn to Use Databricks for the Full ML Lifecycle
PDF
Training AWS: Module 5 - Elastic Load Balancing & ASG
PPTX
Data Governance with Profisee, Microsoft & CCG
 
PDF
Software Supply Chains
PDF
Advantages of Cloud Computing for Business
PPTX
Machine Learning on AWS
PDF
Veracode - Overview
PPTX
Dynamodb Presentation
PDF
AWS Server Migration Service - A Quick Primer
PPTX
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
PPTX
Introduction to snowflake
PPTX
AWS Storage - S3 Fundamentals
PPT
Amazon simple queue service
PDF
Data storage in cloud computing
DOCX
Synopsis for property portal projects for final year students
PDF
MLOps Virtual Event: Automating ML at Scale
DOCX
Indian railway reservation problems rsdo report file
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
[AWS Innovate 온라인 컨퍼런스] ML 모델 생성 및 운영 효율화를 높이는 Amazon SageMaker의 신규 기능들 - 남궁...
A Seminar on NoSQL Databases.
Learn to Use Databricks for the Full ML Lifecycle
Training AWS: Module 5 - Elastic Load Balancing & ASG
Data Governance with Profisee, Microsoft & CCG
 
Software Supply Chains
Advantages of Cloud Computing for Business
Machine Learning on AWS
Veracode - Overview
Dynamodb Presentation
AWS Server Migration Service - A Quick Primer
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
Introduction to snowflake
AWS Storage - S3 Fundamentals
Amazon simple queue service
Data storage in cloud computing
Synopsis for property portal projects for final year students
MLOps Virtual Event: Automating ML at Scale
Indian railway reservation problems rsdo report file
Ad

Similar to Why APM Is Not the Same As ML Monitoring (20)

PDF
Model Monitoring at Scale with Apache Spark and Verta
PPTX
Monitoring Distributed Systems
PPTX
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
PDF
Pipeline analytics concept for posting on linked in
PDF
Pipeline analytics concept for posting
PDF
Managing the Machine Learning Lifecycle with MLflow
PDF
artificggggggggggggggialintelligence.pdf
PPTX
Vgo Sim And Opt
PPTX
Data Science for Retail Broking
PPTX
Data Science for Retail Broking
PPT
Delivering BAM & BPM With Run-Time Integration
PDF
Best Practices for Integrating MLOps in Your AI_ML Pipeline
PDF
Analytics Types.pdfdvf ifbvuibugdfiubuibubufdibhdfiubfduibhfiuvdih
PPTX
Predictive analytics roadshow
PDF
Sage - Clinical Laboratory Management System
PPTX
Data drift and machine learning
PDF
SAS Training session - By Pratima
PDF
PDF
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
PPTX
Data drift and machine learning
Model Monitoring at Scale with Apache Spark and Verta
Monitoring Distributed Systems
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
Pipeline analytics concept for posting on linked in
Pipeline analytics concept for posting
Managing the Machine Learning Lifecycle with MLflow
artificggggggggggggggialintelligence.pdf
Vgo Sim And Opt
Data Science for Retail Broking
Data Science for Retail Broking
Delivering BAM & BPM With Run-Time Integration
Best Practices for Integrating MLOps in Your AI_ML Pipeline
Analytics Types.pdfdvf ifbvuibugdfiubuibubufdibhdfiubfduibhfiuvdih
Predictive analytics roadshow
Sage - Clinical Laboratory Management System
Data drift and machine learning
SAS Training session - By Pratima
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
Data drift and machine learning
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Data Science Trends & Career Guide---ppt
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
Linux OS guide to know, operate. Linux Filesystem, command, users and system
climate analysis of Dhaka ,Banglades.pptx
Data Science Trends & Career Guide---ppt
Taxes Foundatisdcsdcsdon Certificate.pdf
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Launch Your Data Science Career in Kochi – 2025
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Business Ppt On Nestle.pptx huunnnhhgfvu
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Foundation of Data Science unit number two notes
Business Acumen Training GuidePresentation.pptx
1_Introduction to advance data techniques.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

Why APM Is Not the Same As ML Monitoring

  • 1. ML Monitoring is not APM Cory A. Johannsen Product Engineer, Verta Inc. www.verta.ai
  • 2. Agenda ▴ What is APM? ▴ What is ML monitoring? ▴ How ML monitoring and APM differ ▴ The unique needs of ML monitoring ▴ A very cool solution to model monitoring from Verta
  • 3. About https://www.verta.ai/product - End-to-end MLOps platform for ML model delivery, operations and management - Kubernetes-based, operations stack for ML - 23 years as a software engineer - Embedded systems, enterprise software, SaaS - 6 years in APM working at scale
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 6. What is APM? ▴ Application performance Monitoring ▴ Metrics ○ Name ○ Value ○ Labels ○ Timestamp ▴ Visualization ▴ Alerting
  • 7. What do I care about monitoring in APM? ▴ Health ▴ Availability ▴ Performance ▴ Stability ▴ Notification
  • 8. APM in practice ▴ Production operations ▴ Diagnostics and debugging ▴ Critical incident response
  • 9. What is Model Monitoring?
  • 10. ▴ Know when models are failing ▴ Quickly find the root cause ▴ Close the loop by fast recovery 10 Ensuring model results are consistently of high quality *We refer to all latency, throughput etc. as model service health
  • 11. ▴ w/o ground truth, model fails challenging to detect ▴ Need to monitor complex statistical summaries ▴ Distributions, anomalies, missing values, quantiles etc. ▴ Often model-specific ▴ Intelligent detection and alerting to pre-emptively identify issues and trigger remediations ▴ Execute re-trains, fallback models, and human intervention. 11 Know when a model fails Close the loop ▴ A model is one part of a inference pipeline ▴ Need global view of the pipeline jungle to see where the root issue may be Quickly find the root cause
  • 12. How APM and ML monitoring align ▴ Error rate, Throughput, Latency ○ You need to know my production systems are operational ▴ Visualization ○ You need to see change over time ▴ Alerting ○ You need to know when something has gone wrong (and only when something has gone wrong)
  • 13. What do you care about in ML Monitoring? ▴ Distribution ○ Training versus test ○ Iteration over iteration ○ Live prediction ▴ Drift ○ Change in Distribution over time
  • 14. How APM and ML monitoring differ ▴ Error Rate, Throughput, Latency ○ Necessary, no longer sufficient ▴ Not all work is production work ○ ML monitoring happens from the beginning of the pipeline ▴ APM can tell you what is wrong ○ ML monitoring is about understanding why
  • 15. What makes ML monitoring unique ▴ Quantitative analysis of model performance ○ Information you can use ▴ Controlled comparison of distributions ○ Repeatable ○ Reliable ○ Consistent ▴ Alerting on meaningful deviation ○ Actionable ○ Timely ○ Accurate
  • 16. Only you know the shape of your data ▴ Every model and pipeline is different and specialized ○ You built them, you understand them ▴ You know what metrics and distributions are valuable ○ This is your model, you know the data and processes that created it ▴ You know the expected distributions ○ You can determine whether the behavior is correct
  • 17. Only you know how to measure change ▴ Compare to reference set ○ Training, test, golden data set ▴ Compare to a baseline ○ Calculate a baseline from your data or production systems ▴ Compare to other ○ Use a comparison that makes sense in your domain
  • 18. Only you know when a change matters ▴ You know your model and tolerances ▴ You know when a deviation is significant (or not!) ▴ You know when these conditions need to change
  • 19. Verta understand model monitoring ▴ Designed for your workflows ▴ Easy integration to capture your monitoring data ▴ Visualize and understand your metrics, distributions, and drift ▴ Get alerted when you should - not otherwise
  • 20. Introducing a generalized framework for Model Monitoring
  • 21. Concepts ▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to monitor ▴ Profiler: A function that computes statistics about your data ▴ Summary: A collection of statistics about your data (output of profiler) ○ Samples: instance of a summary, i.e., a statistic ○ Labels: key-values attached to summary samples. Used for rich filtering and aggregation ▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information about summaries and identify if they look wrong
  • 22. How does it work? 1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline) 2. Define summaries to monitor for the entity 3. Run profilers (manually or automatically) to produce summary samples 4. View samples, define alerts 5. Get alerted (e.g. via Slack) 6. Close the loop!
  • 23. How does it work? Time-series DB for statistical summaries ... Ground truth Data/Model Pipelines Model (Live) Remediation - Retrain - Rollback - Human loop Model (Batch) Prediction Log
  • 24. Summary ▴ Performance monitoring is no longer sufficient for the needs of modern ML systems ○ Model monitoring starts at the beginning of the pipeline and continues through production ○ Batch and live can be addressed in the same framework ▴ Knowing something is wrong is not enough, you need to know why ▴ Timely actionable alerting is mandatory ▴ Building these tools on-site is difficult, error-prone, and expensive ▴ Spark is a fantastic tool to enable model monitoring
  • 25. Monitor Your Models with Verta ▴ Visit monitoring.verta.ai today and see it in action ▴ Join our community ▴ Get more out of your models ▴ Get more out of your alerts
  • 26. Thank you. Cory A. Johannsen Product Engineer, Verta Inc. www.verta.ai