CT1 - MLOps
Manaranjan Pradhan
Key Objectives
• Understand key challenges at every step in building ML Systems
• Design, Development, Evaluation, Deployment and Monitoring stages
• Equip you with tools, techniques and best practices to deal with these
challenges
• Design, develop and deploy end-to-end ML systems
• Learn some of the Industry Standard tools and platforms
• Focus is on practical lessons.
Exams and Grading
Component Weightage Coding Scheme
Group Assignment 50% 3N-a
Final Exam 50% 4N
Projects
• Project to be completed by a team of 5 (max) people.
• The project will be hosted on Github. You can showcase this
as an accomplishment.
• Plan to write a blog also.
• This will be good learning experience!
Prepare for the class!
• Install and setup Conda environment
• Install Google Colab
• Create a folder on your Google Drive, where you can store all
the code and materials, I send you before each class.
• Sign up for Azure
• You should get 100 USD Student credit if you sign up using your
student (ISB) email id
Why MLOps and ML Systems Design
Netflix 1 Million Dollar Challenge!
Netflix 1M USD Prize
• https://www.wired.com/2012/04/netflix-prize-costs/
• https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429
Netflix Blog
https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
ML Lifecycle
Use Case Data Exploratory Building
Identification Model
Collection Analysis & Machine Model Model
and Problem Validation and
and Feature Learning Deployment Monitoring
Formulation Evaluation
Preparation Engineering Models
First Model Development Iterative Steps
Continuous Model Update Iterative Steps
Lifecycle – Engineering Skills are Leveraged
Use Case Data Exploratory Building
Identification Model
Collection Analysis & Machine Model Model
and Problem and Feature Validation and
Learning Deployment Monitoring
Formulation Preparation Engineering Evaluation
Models
Is being automated.
AutoML is an initiative in
that direction. Can
Can take develop frameworks for
around 60% of selecting models or
total effort Needs some amount of reuse or repurpose an MLOps
statistics and business existing model (transfer
knowledge learning)
Machine Learning Algorithms
Machine Learning
Ability to learn without being
explicitly programmed
Supervised Unsupervised
Where features or factors and the Only features are known. No
outcome is known in the historical data ground truth.
Recommender Dimensionality
Classification Regression Forecasting Clustering
Systems Reduction
Predicting labels which are Predicting labels which are Grouping related items for Projecting higher dimensional data
categorical for example continuous in nature for example Time series example finding customer onto lower dimensions. For example
sales volume or stock price forecasting. Recommending groups images
customer churn or fraud products or services
detection prediction.
What exactly is a ML System?
Part of Development
Continuous Integration (C) Continuous Delivery (CD)
When you say ML systems Only need the model here and the required parameters
like y = B0 + B1*X1 + ....
Incoming requests
Do Transformation + Agorithm
Train How?
Data M1 Log Reg
Test
M2 Decision Tree
Code
- Exploration
Prediction
- Preparation M3 KNN Transfer the model M3
- Modelling
- Evaluation
M4 Rendom Forest
Learning System Prediction /
What Inference System
Serialization/Deserialization Technique: else?
We take a model and write it to a file in either pickel format of ONNX format(Open Neural Network Exchange) and this file need to be
version control. Then we load the file into memory in inference system (deserialization) and when we get input we run the model and
get prediction values.
Different Types of ML systems
Batch prediction Online prediction
Frequency Periodical (e.g. every 4 hours) As soon as requests come
Useful for Processing accumulated data Need prediction result immediately
when you don't need
immediate results (e.g.
recommendation systems)
Optimized High throughput Low latency
quickly execute tasks delay is minimized
Examples ● TripAdvisor hotel ranking ● Google Assistant
● Netflix recommendations speech recognition
● Customer Churn Analysis ● Fraud Detection
https://stanford-cs329s.github.io/
MLOps Training Material - [email protected] 16
MLOps Process Frameworks
Business Understanding
Problem Formulation
Accuracy or
Cost or risk of System
business metrics
model failure Constraints
to measure
Interpretability or Bias and Regulatory
explainability Fairness Requirements
19
All machine learning projects
should be single metric driven.
Customer going to loan default or not
Cost of Model Failure
False TP
Negatives Recall =
TP FN (TP+FN)
Minimize Total Cost =
FP TN
(C1 x FP + C2 x FN)
Where,
C1 = Cost of each False Positives
C2 = Cost of each false negatives
False
Positives
21
Cost of Model Failure
Spam mails
that
appear in
the inbox
TP
Precision =
(TP+FP)
Minimizing FP is more important here
Good mails that appear in the If finding employee attrition: Leave or not leave: FN is more important: Use Recall
spam box
22
The risk or cost of model failure
should be captured at the time of
problem formulation.
Business Metrics
Performance of the recommender system is measured by
Number of quality plays
take-rate =
Number of recommendations user sees
https://dl.acm.org/doi/pdf/10.1145/2843948
24
Deployment Constraints
• Latency
• Throughput
Latency requirement can
impact the choice of models
Input Data or Features
Model
Makes
Decisions
Prediction
Inference / Prediction System
25
Interpretability or explainability
• Model Objective Attrition
Model
• Prediction (Black Box) What is the likelihood
of this employee
• Inference leaving?
• Inference is key to create strategy
• Black box vs glass box models
What are important
factors influencing
employee’s decision
to leave? 26
Bias and Fairness
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
https://fortune.com/2018/10/10/amazon-ai-recruitment-bias-women-sexist/
Problem Formulation
• Inference or Prediction
• Local or Global Inference
• Model Risk Assessment
• Evaluation Metrics
• Model Interpretability or Explainability
• Bias and Fairness Requirements
• Compliance Requirements
• System Constraints
Data Understanding
What Practitioners Say?
1. Data + Schema
2. Storage Efficiency
3. Read Latency
Data format
31
Limitations of Traditional DWs
• Plain-text CSV - a good old friend of a data scientist
• Pickle - a Python’s way to serialize things
• HDF5 - a file format designed to store and organize large amounts of data
• https://en.wikipedia.org/wiki/Hierarchical_Data_Format
• Feather - a fast, lightweight, and easy-to-use binary file format for storing
data frames
• https://github.com/wesm/feather
• Parquet - an Apache Hadoop’s columnar storage format
• https://parquet.apache.org/
• Good for reading a subset of the columns. Mostly used as data lake or data
warehouse storage format.
32
CSV and JSON are semi-structured data
Pros:
• Widely used
• Plain text file – Can open it in any computer, readable by humans
• Can be read from and written to by most data software
Cons:
• Not the most efficient way to store or access
• No formal standard, so there is room for user interpretation on how to handle edge cases
• No Schema attached
Note:
• A great default option for most use cases
• < 100 MB data
Parquet or ORC or Pickle: Data + Schema columnar storage format
Pros:
• Very fast
• Naturally understands all dtypes used by pandas, including multi-index DataFrames
• Very common in “big data” systems like Hadoop or Spark
• Supports various compression algorithms
Cons:
• Binary storage format that is not human-readable
Note:
• > 100 MB data
• When many systems are accessing it (BI, ML Platforms, any other analysis tool)
• When tool many columns are present, but you need to access only a subset of it
Row Oriented Vs. Columnar Format
Pickle
Pros:
• Python native serializable format. Highly optimized for python read and write
Cons:
• No other language or system can understand this.
Note:
• Only when you know only python systems will read this file
• Can be used as a staging file during pipelines
Missing: 1. How much is missing?: >20% missing data, don't use that column
2. Imputation: by default value, mean/median, models
Data Profiling Outliers: 1. extreme outliers
2.
• Missing Values
• Outliers
• Bad Data Quality
• Data Sampling Errors
https://careersatdoordash.com/blog/five-common-data-quality-gotchas-in-machine-learning-and-how-
to-detect-them-quickly/
Train-Test Split
Train Test
• Is not appropriate for model search
• Experimenting with multiple models (different algorithms)
• Searching for best model (same algorithm) with optimal
hyperparameters
• Test data is used multiple times for model validation
• Information gets leaked into modelling process
• Test accuracy of final model is optimistically biased
Train-Validation-Test Split
Train Val Test
• Is appropriate for model search
• Experimenting with multiple models (different algorithms)
• Searching for best model (same algorithm) with optimal
hyperparameters
• Validation set remains static
• Information gets leaked into modelling process
• Validation accuracy may be optimistically biased
To find the best model we do cv
K-Fold Cross Validation Split
5-Fold here
Train Holdout
train with K1 or K4 and test it against k5 and see
accuracy. Do this for all iterations and take average
of the accuracies.
Similar can be done with different mode (DT) and
accuracies of the two models will be compared.
Depending on this model will be selected.
Divide train data into multifolds and test with a unique fold and iterate
Class Imbalance Type text here
● Not enough signal to learn about rare ● Fraud detection
classes ● Spam detection
● Statistically speaking, predicting ● Disease screening
majority label has higher chance of ● Churn prediction
being right
● Imbalance often comes with
differences in cost of wrong predictions
Class imbalance solution: Resampling
Under sampling Oversampling
Remove samples from the majority Add more examples to the minority
class class
Can cause overfitting Can cause loss of information
Over sampling: SMOTE
Synthetic Minority Oversampling Technique
Types of data leakage
● Data Leakage
○ Premature featurization: creating feature on the entire data instead of just
training data
■ E.g. create n-gram counts/vocabulary from train + test sets
○ Oversampling before splits
■ Train splits might contain test samples
○ Time leakage
■ Randomly splitting data instead of temporal split can cause training data to be able to see
the future
○ Group leakage
■ A patient has 3 CT scans: 2 in train, 1 in test.
44
How to avoid leakage?
● Check for duplication between train and valid/test splits
● Temporal split data (if possible)
● Use only train splits for feature engineering
● Train model on subset of features
○ If performance very high on a subset, either very good set of features or
leakage!
● Monitor model performance as more features are added
○ Sudden increase: either a very good feature or leakage!
● Involve subject matter experts in the process
45
Keeping track of all things!
Software Development, only
code is version controlled
Pipeline Model
Data + /Code +
Any change in Any change in Any change in
samples will effect imputation, scaling hyperparameters
the performance and encoding will effect the
of the model. techniques will performance of
change the the mode.
performance of
the model.
Model Developement
Do we always need to build a ML Model?
Model Baselining
https://eugeneyan.com/writing/first-rule-of-ml/
Google’s Rule for Machine Learning
https://developers.google.com/machine-learning/guides/rules-of-ml
Before building a ML model start with a baseline model
Baselining
• Create a system with if/else rules from heuristics
• Build a simple ML (linear regression) first
• Build a system with regex (hand crafted regular expressions) for
classifying text data
• Benefits of creating a heuristics system
When you are forced to build ML systems?
Which model need to be built?
Which model to select?
Called Black box models
Accuracy Vs. Explainability
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
https://www.bbc.com/news/technology-45809919
• White box Vs. Black box models
• Explainable AI (XAI)
https://fortune.com/2018/10/10/amazon-ai-recruitment-bias-women-sexist/
• Bias and Fairness
Model Development is messy!
• Run number of experiments to refine your model
• Experiments may involve
• Different Transformations
• Different Models
• Different Hyperparameters
• Easy to lose track of code, hyperparameters, and artifacts
• Fail to reproduce experiments (reproducibility)
Building and deploying a Pipeline
Preprocessors
OHE Encode the
Scale the
Train categorical Model
numerical features
variables
MLFlow: For local use
Neptune.io
Test
pipeline
Deploy
Weights and Biases : Cloud based: Mostly used
the
Prediction
System
Defining preprocessor
Defining the model
Generating the pipeline
Weights and biases tool: https://wandb.ai/site/
Experiment Tracking
• Manually track the results of all model runs in a spreadsheet
• Use experiment tracking tools
• Weights and Biases
• MLFlow
• Neptune.ai
AutoML
• Finding right model can be time consuming
• Time to market can be critical
• Unavailability of expertise in enterprises
• Benefits of using AutoML:
• Improve efficiency by automatically running repetitive tasks. This allows data
scientists to focus more on problems (like data) instead of models.
• Automated ML pipelines also help avoid potential errors caused by manual
work.
• AutoML is a big step towards the democratization of machine learning and
allows everyone to use ML features.
AutoML Frameworks
AutoML Frameworks
• Two types of frameworks
• Searches possible models from traditional ML algorithms
• Linear, SVM, KNN, Bagging, Boosting, Naïve Bayes etc.
• Does hyperparameter tuning
• Works with mostly structured data
• Neural Network Search
• Searches for neural network architectures
• Number of neurons and layers
• Works with structured and unstructured data
Extra:
Ensembling Techniques:
1. Bagging
2. Boosting
3. Stacking: Build all the different models, take outcome from each model and then again pass it through a meta model and get prediction. This is popular
now a days
AutoML outputs leaderboard and lists all the models it has tried and ranks them
Leaderboards
AutoML Frameworks
Popular
• https://isg.beel.org/blog/2020/04/09/list-of-automl-tools-and-software-libraries/
• https://medium.com/swlh/8-automl-libraries-to-automate-machine-learning-pipeline-3da0af08f636
Searching parameters
• Max run time – Limits the time to experiment
• Max models – Limits the number of experiments
• Stopping Metrics and tolerance: MSE, AUC R_square<=0.85
• Sorting Metrics – For leaderboard creation
• Exclude algos e.g. ["GLM", "DeepLearning"]
• Include algos e.g. ["GBM", " XGBoost", "DRF"]
• Preprocessing
• for example scaling, various encoding (OHE, Target etc.) – Not many frameworks support
this.
How to use AutoML?
• Should be used as a guidance tool
• May not take the suggested model directly to production
• Gives guidance on
• What models can be used
• What feature engineering can be used (though this is not a replacement of
actual feature engineering based on domain knowledge)
• Can be an indicator of what accuracy can be expected