C2 - W1 Mlopssadsa
C2 - W1 Mlopssadsa
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
Welcome
The importance of data
“Data is the hardest part of ML and the most important piece to get right...
Broken data is the most common cause of problems in production ML systems”
- Scaling Machine Learning at Uber with Michelangelo - Uber
“No other activity in the machine learning life cycle has a higher return on
investment than improving the data a model has access to.”
- Feast: Bridging ML Models and Data - Gojek
Introduction to Machine
Learning Engineering
for Production
Overview
Outline
● Challenges in production ML
Traditional ML modeling
Data Training
Yields
Evaluation
Production ML systems require so much more
Machine Resource
Data Verification
Management
Data Collection
Serving
Configuration Monitoring
Infrastructure
ML Code Analysis Tools
Academic/Research ML Production ML
Priority for design Highest overall accuracy Fast inference, good interpretability
Model training Optimal tuning and training Continuously assess and retrain
Machine learning
development
+ Modern software
development
Managing the entire life cycle of data
● Labeling
● Feature space coverage
● Minimal dimensionality
● Maximum predictive data
● Fairness
● Rare conditions
Modern software development
Accounts for:
● Scalability ● Modularity
● Extensibility ● Testability
● Configuration ● Monitoring
● Consistency & reproducibility ● Best practices
● Safety & security
Production machine learning system
New data
Define Define data Label and
Select and Perform error Deploy in Monitor and
Project and establish organize
train model analysis production maintain
baseline data
system
Challenges in production grade ML
ML Pipelines
Outline
● ML Pipelines
New data
Infrastructure for
automating, monitoring, and maintaining
model training and deployment
Production ML infrastructure
18
Directed acyclic graphs
Machine
Data
Resource
Data Collection Verification
Management
Serving
Configuration Monitoring
ML Code
Infrastructure
Analysis Tools
Example
Bulk Inference
Validator
TFX Hello World
Key points
New data
Importance of Data
Outline
“Data is the hardest part of ML and the most important piece to get right... Broken data is the most
common cause of problems in production ML systems”
- Scaling Machine Learning at Uber with Michelangelo - Uber
“No other activity in the machine learning life cycle has a higher return on investment than improving
the data a model has access to.”
- Feast: Bridging ML Models and Data - Gojek
ML: Data is a first class citizen
● Software 1.0
y
xit
Software 1.0
ple
○ Explicit instructions to the computer
m
Co
ram
og
Pr
● Software 2.0
● Data collection
● Data ingestion
● Data formatting
● Feature engineering
● Feature extraction
Data collection and monitoring
● Downtime
● Errors
● Distributions
shifts
● Data failure
● Service failure
Key Points
Example Application:
Suggesting Runs
Example application: Suggesting runs
Users Runners
FEATURES
LABELS
X8KGF Seattle Oktoberfest 00:35:40 0 ft High
5k
● Inconsistent formatting
○ Is zero “0”, “0.0”, or an indicator of a missing measurement
● Runner demographics
● Time of day
● Run completion rate
Responsible Data:
Security, Privacy &
Fairness
Outline
● Data Sourcing
Web scraping
Build synthetic
Collect live data
dataset
Data security and privacy
● Representational harm
● Opportunity denial
● Disproportionate product failure
● Harm by disadvantage
Commit to fairness
Raters
Subject matter
Generalists Your Users
experts
Kinds of problems:
• Styles change
World changes • Scope and processes change
• Competitors change
• Business expands to other geos
Sudden problems
● Labeling
○ Curated datasets
○ Crowd-based
Harder problems
● Labeling
○ Direct feedback
○ Crowd-based
Really hard problems
● Labeling
○ Direct feedback
○ Weak supervision
Key points
Variety of Methods
Labels from
Features from Join results with
monitoring
inference requests inference requests
predictions
Logstash
Free and open source data processing pipeline
● Ingests data from a multitude of sources
● Transforms it
● Sends it to your favorite "stash."
Fluentd
Open source data collector
Unify the data collection and consumption
Process feedback - Cloud log analytics
AWS ElasticSearch
Cloud Log Analysis
Azure Monitor
Human labeling
● More labels
● Pure supervised learning
Human labeling - Disadvantages
Slow
Expensive
● Data issues
○ Drift and skew
■ Data and concept Drift
■ Schema Skew
■ Distribution Skew
Drift
Changes in data over time, such as data collected once
a day
Skew
Difference between two static versions, or different
sources, such as training set and serving set
Typical ML pipeline
During training During serving
Data Request
Real-time
Batch processing
processing
Model Decay : Data drift
Update
Performance decay : Concept drift
Training
Serving
Detecting data issues
Marginal
Concept shift
Skew detection workflow
Validate statistics
Detect anomalies
TensorFlow
Data Validation
TensorFlow Data Validation (TFDV)
Validate statistics
1. Schema Skew
Detect anomalies
2. Feature Skew
2 1 1 1 2
2 1 1 2
2 1 1 1 2
● Set a threshold to receive warnings 2 2 2 2 2
Schema skew