Why APM Is Not the Same As ML Monitoring

ML Monitoring is not APM
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai

Agenda
▴ What is APM?
▴ What is ML monitoring?
▴ How ML monitoring and APM differ
▴ The unique needs of ML monitoring
▴ A very cool solution to model monitoring from Verta

About
https://www.verta.ai/product
- End-to-end MLOps platform for ML
model delivery, operations and
management
- Kubernetes-based, operations stack
for ML
- 23 years as a software engineer
- Embedded systems, enterprise
software, SaaS
- 6 years in APM working at scale

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

What is APM?
▴ Application performance Monitoring
▴ Metrics
￮ Name
￮ Value
￮ Labels
￮ Timestamp
▴ Visualization
▴ Alerting

What do I care about monitoring in APM?
▴ Health
▴ Availability
▴ Performance
▴ Stability
▴ Notiﬁcation

APM in practice
▴ Production operations
▴ Diagnostics and debugging
▴ Critical incident response

▴ Know when models are failing
▴ Quickly ﬁnd the root cause
▴ Close the loop by fast recovery
10
Ensuring model results are
consistently of high quality
*We refer to all latency, throughput etc. as model service health

▴ w/o ground truth, model
fails challenging to detect
▴ Need to monitor complex
statistical summaries
▴ Distributions, anomalies,
missing values, quantiles
etc.
▴ Often model-speciﬁc
▴ Intelligent detection
and alerting to
pre-emptively identify
issues and trigger
remediations
▴ Execute re-trains,
fallback models, and
human intervention.
11
Know when a model fails Close the loop
▴ A model is one part of a
inference pipeline
▴ Need global view of the
pipeline jungle to see
where the root issue
may be
Quickly ﬁnd the root cause

How APM and ML monitoring align
▴ Error rate, Throughput, Latency
￮ You need to know my production systems are
operational
▴ Visualization
￮ You need to see change over time
▴ Alerting
￮ You need to know when
something has gone wrong
(and only when something
has gone wrong)

What do you care about in ML Monitoring?
▴ Distribution
￮ Training versus test
￮ Iteration over iteration
￮ Live prediction
▴ Drift
￮ Change in Distribution over
time

How APM and ML monitoring differ
▴ Error Rate, Throughput, Latency
￮ Necessary, no longer sufﬁcient
▴ Not all work is production work
￮ ML monitoring happens from the beginning
of the pipeline
▴ APM can tell you what is wrong
￮ ML monitoring is about understanding why

What makes ML monitoring unique
▴ Quantitative analysis of model performance
￮ Information you can use
▴ Controlled comparison of distributions
￮ Repeatable
￮ Reliable
￮ Consistent
▴ Alerting on meaningful deviation
￮ Actionable
￮ Timely
￮ Accurate

Only you know the shape of your data
▴ Every model and pipeline is different and specialized
￮ You built them, you understand them
▴ You know what metrics and distributions are valuable
￮ This is your model, you know the data and processes that created it
▴ You know the expected distributions
￮ You can determine whether the behavior is correct

Only you know how to measure change
▴ Compare to reference set
￮ Training, test, golden data set
▴ Compare to a baseline
￮ Calculate a baseline from your data or production systems
▴ Compare to other
￮ Use a comparison that makes sense in your domain

Only you know when a change matters
▴ You know your model and tolerances
▴ You know when a deviation is signiﬁcant (or not!)
▴ You know when these conditions need to change

Verta understand model monitoring
▴ Designed for your workﬂows
▴ Easy integration to capture your monitoring data
▴ Visualize and understand your metrics, distributions, and drift
▴ Get alerted when you should - not otherwise

Introducing a generalized
framework for Model Monitoring

Concepts
▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to
monitor
▴ Profiler: A function that computes statistics about your data
▴ Summary: A collection of statistics about your data (output of profiler)
￮ Samples: instance of a summary, i.e., a statistic
￮ Labels: key-values attached to summary samples. Used for rich filtering and
aggregation
▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information
about summaries and identify if they look wrong

How does it work?
1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline)
2. Define summaries to monitor for the entity
3. Run profilers (manually or automatically) to produce summary samples
4. View samples, define alerts
5. Get alerted (e.g. via Slack)
6. Close the loop!

How does it work?
Time-series DB for
statistical summaries
...
Ground truth
Data/Model
Pipelines
Model (Live)
Remediation
- Retrain
- Rollback
- Human loop
Model (Batch)
Prediction
Log

Summary
▴ Performance monitoring is no longer sufﬁcient for the needs of modern ML systems
￮ Model monitoring starts at the beginning of the pipeline and continues through production
￮ Batch and live can be addressed in the same framework
▴ Knowing something is wrong is not enough, you need to know why
▴ Timely actionable alerting is mandatory
▴ Building these tools on-site is difﬁcult, error-prone, and expensive
▴ Spark is a fantastic tool to enable model monitoring

Monitor Your Models with Verta
▴ Visit monitoring.verta.ai today and see it in action
▴ Join our community
▴ Get more out of your models
▴ Get more out of your alerts

Thank you.
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai

Why APM Is Not the Same As ML Monitoring

More Related Content

What's hot (20)

Similar to Why APM Is Not the Same As ML Monitoring (20)

More from Databricks (20)

Recently uploaded (20)

Why APM Is Not the Same As ML Monitoring