0% found this document useful (0 votes)
340 views56 pages

MLOPS

The document discusses machine learning deployment and operations. It covers offline deployment where a trained model is applied to unlabeled data. It also discusses online prediction serving, monitoring and versioning of models through MLOps as well as federated machine learning and LLMOps.

Uploaded by

Priyanshu Bhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
340 views56 pages

MLOPS

The document discusses machine learning deployment and operations. It covers offline deployment where a trained model is applied to unlabeled data. It also discusses online prediction serving, monitoring and versioning of models through MLOps as well as federated machine learning and LLMOps.

Uploaded by

Priyanshu Bhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CSE 234

Data Systems for Machine Learning

Arun Kumar

Topic 3: ML Deployment, MLOps, and LLMOps

Chapter 8.5 of MLSys book

1
ML Deployment in the Lifecycle

Feature Engineering
Data acquisition Serving
Training & Inference
Data preparation Monitoring
Model Selection 2
ML Deployment in the Lifecycle

3
Outline

❖ Offline ML Deployment
❖ MLOps:
❖ Online Prediction Serving
❖ Monitoring and Versioning
❖ Federated ML
❖ LLMOps

4
Offline ML Deployment

❖ Given: A trained prediction function f(); a set of (unlabeled) data


examples
❖ Goal: Apply inference with f() to all examples efficiently
❖ Key metrics: Throughput, cost, latency
❖ Historically, offline was the most common scenario
❖ Still is at most enterprises, healthcare, academia
❖ Typically once a day / week / month / quarter!
❖ Aka model scoring in some settings

5
Offline ML Deployment: Systems

❖ Not particularly challenging in most applications


❖ All ML systems support offline batch inference by default

Disk-based files: Layered on RDBMS/Spark:

General ML
Libraries: Cloud-native: “AutoML” platforms:

GBDT Systems: DL Systems: LLM Systems:

6
Offline ML Deployment: Optimizations

Q: What systems-level optimizations are possible here?

❖ Data Parallelism:
❖ Inference is embarrassingly parallel across examples
❖ Factorized ML (e.g., in Morpheus):
❖ Push ML computations down through joins
❖ Pre-computes some FLOPS and reuses across examples
<latexit sha1_base64="2V0kohkHpexr5CSTE+avL90vdmQ=">AAACCnicbZDLSsNAFIYnXmu9RV26GS2CCymJKAoiFN24EaqYtpCGMJlO2qGTCzMTaQlZu/FV3LhQxK1P4M63cZpG0NYfBj7+cw5nzu/FjAppGF/azOzc/MJiaam8vLK6tq5vbDZElHBMLByxiLc8JAijIbEklYy0Yk5Q4DHS9PqXo3rznnBBo/BODmPiBKgbUp9iJJXl6jsDl8JzaA/clB7cZmcwB+sHrjPH1StG1cgFp8EsoAIK1V39s92JcBKQUGKGhLBNI5ZOirikmJGs3E4EiRHuoy6xFYYoIMJJ81MyuKecDvQjrl4oYe7+nkhRIMQw8FRngGRPTNZG5n81O5H+qZPSME4kCfF4kZ8wKCM4ygV2KCdYsqEChDlVf4W4hzjCUqVXViGYkydPQ+Owah5XjZujSu2iiKMEtsEu2AcmOAE1cAXqwAIYPIAn8AJetUftWXvT3setM1oxswX+SPv4BhzkmUQ=</latexit>

xi = [xi,R ; xi,U ; xi,M ]


Example: GLM inference:
w T xi = w R
T T T
<latexit sha1_base64="rZ9cDtVxw9VjJ0+8Nd1k/o4JYyM=">AAACIHicbVBNSwJBGJ7t0+xrq2OXIQmCQnajsEsgdekimLgqqC2z46iDs7PLzGwmiz+lS3+lS4ci6la/plFXKO2FgefjfXnnfbyQUaks68tYWFxaXllNraXXNza3ts2d3YoMIoGJgwMWiJqHJGGUE0dRxUgtFAT5HiNVr3c98qv3REga8LIahKTpow6nbYqR0pJr5vp3ZfjgUngJ+25pjGN6UhrCY82dKXcmvDDlhaFrZqysNS44D+wEZEBSRdf8bLQCHPmEK8yQlHXbClUzRkJRzMgw3YgkCRHuoQ6pa8iRT2QzHh84hIdaacF2IPTjCo7V3xMx8qUc+J7u9JHqyllvJP7n1SPVvmjGlIeRIhxPFrUjBlUAR2nBFhUEKzbQAGFB9V8h7iKBsNKZpnUI9uzJ86BymrXPs9btWSZ/lcSRAvvgABwBG+RAHtyAInAABo/gGbyCN+PJeDHejY9J64KRzOyBP2V8/wCHMaC9</latexit>

xi,R + wU xi,U + wM xi,M


7
Offline ML Deployment: Optimizations

Q: What systems-level optimizations are possible here?


❖ More general pre-computation / caching / batching:
❖ Factorized ML is a specific form of sharing/caching
❖ Other forms of “multi-query optimization” possible

Example: Batched inference for separate GLMs:


<latexit sha1_base64="O5xcOdH8jwPhx9HWfNjC8HEluKY=">AAACCnicbVBNS8NAEJ34WetX1aOX1SLUS0lE0WPRi8cK9gPaEDabTbt0swm7G6WEnr34V7x4UMSrv8Cb/8ZtG0FbHww83pthZp6fcKa0bX9ZC4tLyyurhbXi+sbm1nZpZ7ep4lQS2iAxj2Xbx4pyJmhDM81pO5EURz6nLX9wNfZbd1QqFotbPUyoG+GeYCEjWBvJKx20vUygrmYRVSgYocq95xx7WfAjOSOvVLar9gRonjg5KUOOulf67AYxSSMqNOFYqY5jJ9rNsNSMcDoqdlNFE0wGuEc7hgps9rjZ5JUROjJKgMJYmhIaTdTfExmOlBpGvumMsO6rWW8s/ud1Uh1euBkTSaqpINNFYcqRjtE4FxQwSYnmQ0MwkczcikgfS0y0Sa9oQnBmX54nzZOqc1a1b07Ltcs8jgLswyFUwIFzqME11KEBBB7gCV7g1Xq0nq03633aumDlM3vwB9bHN0U/mVo=</latexit>

Xn⇥d (w1 )d⇥1 <latexit sha1_base64="FbyWeykkKjbYXgcgUvh1k2Dkhgw=">AAACB3icbVDLSsNAFJ34rPUVdSnIYBFclaRVFNwU3bisYB+QhjCZTNqhk0mYmVhK6M6Nv+LGhSJu/QV3/o2TNgttPTDD4Zx7ufceP2FUKsv6NpaWV1bX1ksb5c2t7Z1dc2+/LeNUYNLCMYtF10eSMMpJS1HFSDcRBEU+Ix1/eJP7nQciJI35vRonxI1Qn9OQYqS05JlHXeiMPPsKjrxa/tVdLwtgT9GISFifeGbFqlpTwEViF6QCCjQ986sXxDiNCFeYISkd20qUmyGhKGZkUu6lkiQID1GfOJpypOe42fSOCTzRSgDDWOjHFZyqvzsyFEk5jnxdGSE1kPNeLv7nOakKL92M8iRVhOPZoDBlUMUwDwUGVBCs2FgThAXVu0I8QAJhpaMr6xDs+ZMXSbtWtc+r1t1ZpXFdxFECh+AYnAIbXIAGuAVN0AIYPIJn8ArejCfjxXg3PmalS0bRcwD+wPj8AQiPl4U=</latexit>

X[w1 ; w2 ; w3 ]d⇥3
Xw2 Xw3
<latexit sha1_base64="ILwJ5m3jrQ8hO3GSh3nWcgWIy24=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoseiF48VTFtoQ9lsJ+3SzSbsbpRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCoqZNMMfRZIhLVDqlGwSX6hhuB7VQhjUOBrXB0O/Nbj6g0T+SDGacYxHQgecQZNVby2+SpV+uVK27VnYOsEi8nFcjR6JW/uv2EZTFKwwTVuuO5qQkmVBnOBE5L3UxjStmIDrBjqaQx6mAyP3ZKzqzSJ1GibElD5urviQmNtR7Hoe2MqRnqZW8m/ud1MhNdBxMu08ygZItFUSaIScjsc9LnCpkRY0soU9zeStiQKsqMzadkQ/CWX14lzVrVu6y69xeV+k0eRxFO4BTOwYMrqMMdNMAHBhye4RXeHOm8OO/Ox6K14OQzx/AHzucPDoGOMA==</latexit>

<latexit sha1_base64="DZ7caenQYrDXMFVHE77M5yj9HSQ=">AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKez6QI9BLx4juEkgWcLspDcZMju7zMwqYck3ePGgiFc/yJt/4+Rx0MSChqKqm+6uMBVcG9f9dgorq2vrG8XN0tb2zu5eef+goZNMMfRZIhLVCqlGwSX6hhuBrVQhjUOBzXB4O/Gbj6g0T+SDGaUYxLQvecQZNVbyW+Spe94tV9yqOwVZJt6cVGCOerf81eklLItRGiao1m3PTU2QU2U4EzgudTKNKWVD2se2pZLGqIN8euyYnFilR6JE2ZKGTNXfEzmNtR7Foe2MqRnoRW8i/ue1MxNdBzmXaWZQstmiKBPEJGTyOelxhcyIkSWUKW5vJWxAFWXG5lOyIXiLLy+TxlnVu6y69xeV2s08jiIcwTGcggdXUIM7qIMPDDg8wyu8OdJ5cd6dj1lrwZnPHMIfOJ8/EAWOMQ==</latexit>

Reduces memory stalls for X; raises hardware efficiency


8
Peer Instruction Activity

(Switch slides)

9
Outline

❖ Offline ML Deployment
❖ MLOps:
❖ Online Prediction Serving
❖ Monitoring and Versioning
❖ Federated ML
❖ LLMOps

10
Background: DevOps

❖ Software Development + IT Operations (DevOps) is a long


standing subarea of software engineering
❖ No uniform definition but loosely, the science + engineering of
administering software in production
❖ Fuses many historically separate job roles
❖ Cloud and “Agile” s/w eng. have revolutionized DevOps

11
Background: DevOps

12
https://medium.com/swlh/how-to-become-an-devops-engineer-in-2020-80b8740d5a52
Key Parts of DevOps Stack/Practice

Logging & Monitoring

Continuous Integration (CI)


Building & Testing Version Control
& Continuous Delivery (CD)

Infrastructure-as-Code (IaC), Microservices /


including Config. & Policy Containerization & Orchestration

Content Credit: Manasi Vartak, Verta.AI


https://aws.amazon.com/devops/what-is-devops/ 13
The Rise of MLOps

❖ MLOps = DevOps for ML-infused software


❖ Much harder than for deterministic software!
❖ Things that matter beyond just ML model codes:
❖ Training and validation datasets
❖ Data cleaning/prep/featurization codes/scripts
❖ Hyperparameters, other training configs
❖ Post-inference rules/configs/ensembling
❖ Software versions/configs?
❖ Training hardware/configs?

Content Credit: Manasi Vartak, Verta.AI 14


The Rise of MLOps

❖ Need to change DevOps for ML program semantics


❖ Online Prediction Serving
❖ Logging & Monitoring:
❖ Prediction failures; concept drift; feature inflow changes
❖ Version Control:
❖ Anything can change: ML code, data, configs, etc.
❖ Build & Test; CI & CD:
❖ Rigorous train-val-test splits; beware insidious overfitting
❖ New space with a lot of R&D; no consensus on standards

Content Credit: Manasi Vartak, Verta.AI 15


The “3 Vs of MLOps”

❖ Velocity:
❖ Need for rapid experimentation, prototyping, and deployment
with minimal friction
❖ Validation:
❖ Need for checks on quality and integrity of data, features,
models, predictions
❖ Versioning:
❖ Need to keep track of deployed models and features to
ensure provenance and fallback options

16
https://arxiv.org/pdf/2209.09125.pdf
The “3 Vs of MLOps”

❖ Interplay/tussles between the 3 Vs shapes decisions on tools,


processes, and people management in MLOps
❖ Examples:
❖ Should Jupyter notebooks be deployed to production?
Velocity vs. Validation
❖ Are feature stores needed? Velocity vs. Versioning
❖ Relabel/augment val. data? Validation vs. Versioning

17
https://arxiv.org/pdf/2209.09125.pdf
Birds-eye View of MLOps

18
https://arxiv.org/pdf/2205.02302.pdf
Birds-eye View of MLOps

19
https://arxiv.org/pdf/2205.02302.pdf
Outline

❖ Offline ML Deployment
❖ MLOps:
❖ Online Prediction Serving
❖ Monitoring and Versioning
❖ Federated ML
❖ LLMOps

20
Online Prediction Serving

❖ Standard setting for Web and IoT deployments of ML


❖ Typically need to be realtime; < 100s of milliseconds!
❖ AKA model serving or ML serving
❖ Given: A trained prediction function f() + a stream of unlabeled
data example(s)
❖ Goal: Apply f() to all/each example efficiently
❖ Key metrics: Latency, memory footprint, cost, throughput

21
Online Prediction Serving

❖ Surprisingly challenging to do well in ML systems practice!


❖ Active area of R&D; many startups
❖ Key Challenges:
❖ Heterogeneity of environments: webpages, cloud-based
apps, mobile apps, vehicles, IoT, etc.
❖ Unpredictability of load: need to elastically upscale or
downscale resources
❖ Function’s complexity: model, featurization and data prep
code, output thresholds, etc.
❖ May straddle libraries and even PLs!
❖ Hard to optimize end to end in general
22
The Rise of Serverless Infra.

❖ Prediction serving is a “killer app” for Function-as-a-Service


(FaaS), AKA serverless cloud infra.
❖ Extreme pay-as-you-go; can rent at millisecond level!

❖ Still, many open efficiency issues for ML deployment:


❖ Reduce memory footprints, input access restrictions,
logging / output persistence restrictions, latency
23
Online Prediction Serving: Systems

❖ Numerous serving systems have sprung up

General-purpose (supports multiple ML tools):

ML System-specific:

TF Serving TorchServe
24
Clipper

❖ A pioneering general-purpose ML serving system

25
Clipper: Principles and Techniques

❖ Generality and modularity:


❖ One of the first to use containers for prediction serving
❖ Supports multiple ML tools in unified layered API
❖ Efficiency:
❖ Some basic optimizations: batching to raise throughput;
caching of frequently access models/vectors
❖ Multi-model deployment and flexibility:
❖ A heuristic “model selection” layer to dynamically pick among
multiple deployed models; ensembling

26
TensorFlow Serving

❖ TF Serving is a mature ML serving system, also pioneering


❖ Optimized for TF model formats; also supports batching
❖ Dynamic reloading of weights; multiple data sources

❖ TF Lite and
TF.JS
optimized for
more niche
backends/
runtime
environments

27
VLLM: Overview
❖ Goal: Improve throughput of serving LLMs on GPUs
❖ Observation: Memory fragmentation due to dynamic memory
footprint of attention ops’ KV tensors wastes GPU memory
❖ Key Idea:
❖ Level of indirection akin to paging and virtual memory in OS
❖ Group attention ops’ KV tensors into a block per set of tokens
❖ Blocks need not be laid out contiguously in GPU memory

28
VLLM: Techniques and Impact

❖ (Switch to Hao’s slide deck)

29
Your Reviews on VLLM Paper

❖ (Walked through in class)

30
Comparing ML Serving Systems

❖ Benefits of general-purpose vs. ML system-specific:


❖ Tool heterogeneity is a reality for many orgs
❖ More nimble to customize accuracy post-deployment with
different kinds of models/tools
❖ Flexibility to swap ML tools; no “tool lock-in”
❖ Benefits of ML system-specific vs. general-purpose:
❖ Generality may not be needed inside org. (e.g., Google);
lower complexity of MLOps
❖ Likely more amenable to code/pipeline optimizations
❖ Likely better hardware utilization, lower cloud costs

31
Outline

❖ Offline ML Deployment
❖ MLOps:
❖ Online Prediction Serving
❖ Monitoring and Versioning
❖ Federated ML
❖ LLMOps

32
Example for ML Monitoring: TFX

❖ TFX’s “Model Analysis” lets user specify metrics, track over time
automatically, alert on-call
❖ Can specify metrics for feature-based data “slices” too

33
https://www.tensorflow.org/tfx/guide/tfma
Example for ML Versioning: Verta

❖ Started with ModelDB for storing and tracking ML artifacts


❖ ML code; data; configuration; environment
❖ APIs as hooks into ML dev code; SDK and web app./GUI
❖ Registry for versions and workflows

34
https://blog.verta.ai/blog/the-third-wave-of-operationalization-is-here-mlops
Open Research Questions in MLOps

❖ Efficient and consistent version control for ML datasets and


featurization codes
❖ Automate prediction failure detection and recovery
❖ Detect concept drift in an actionable manner; prescribe fixes
❖ Velocity and complexity of streaming ML applications
❖ CI & CD for model ensembles without insidious overfitting
❖ Automated end-to-end optimizations
❖ …

35
Outline

❖ Offline ML Deployment
❖ MLOps:
❖ Online Prediction Serving
❖ Monitoring and Versioning
❖ Federated ML
❖ LLMOps

36
Federated ML

❖ Pioneered by Google for ML/AI applications on smartphones


❖ Key benefit is more user privacy:
❖ User’s (labeled) data does not leave their device
❖ Decentralizes ML model training/finetuning to user data

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
37
https://mlsys.org/Conferences/2019/doc/2019/193.pdf
Federated ML

❖ Key challenge: Decentralize SGD to intermittent updates


❖ They proposed a simple “federated averaging” algorithm

❖ User-partitioned updates breaks IID assumption; skews arise


❖ Turns out SGD is still pretty robust (recall async. PS); open
theoretical questions still being studied

38
https://arxiv.org/abs/1602.05629
Federated ML

❖ Privacy/security-focused improvements:
❖ New SGD variants; integration with differential privacy
❖ Cryptography to anonymize update aggregations
❖ Apart from strong user privacy, communication and energy
efficiency also major concerns on battery-powered devices
❖ Systems+ML heuristic optimizations:
❖ Compression and quantization to save upload bandwidth
❖ Communicate only high quality model updates
❖ Novel federation-aware ML algorithmics

https://arxiv.org/abs/1602.05629
https://arxiv.org/pdf/1610.02527.pdf
39
https://eprint.iacr.org/2017/281.pdf
Federated ML

❖ Federated ML protocol has become quite sophisticated to ensure


better stability/reliability, accuracy, and manageability

40
https://mlsys.org/Conferences/2019/doc/2019/193.pdf
Federated ML

❖ Google has neatly abstracted the client-side (embedded in mobile


app.) and server-side functionality with actor design

41
https://mlsys.org/Conferences/2019/doc/2019/193.pdf
Federated ML

❖ Notion of “FL Plan” and simulation-based tooling for data


scientists to tailor ML for this deployment regime
❖ (Users’) Training data is out of reach!
❖ Model is updated asynchronously automatically
❖ Debugging and versioning become even more difficult

42
https://mlsys.org/Conferences/2019/doc/2019/193.pdf
Outline

❖ Offline ML Deployment
❖ MLOps:
❖ Online Prediction Serving
❖ Monitoring and Versioning
❖ Federated ML
❖ LLMOps

43
LLMOps: Birds-Eye View

44
https://www.databricks.com/glossary/llmops
LLMOps: Emerging Stack

45
https://a16z.com/emerging-architectures-for-llm-applications/
LLMOps: Principles and Practices

❖ Three main groups of new technical concerns in LLMOps:


❖ Managing Data Ingestion
❖ Chunking of text, embedding creation/indexing/maintenance,
Retrieval-Augmented Generation (RAG)
❖ Managing LLM API Usage
❖ Prompt engineering / management, application abstractions,
caching, logging, validation/“guardrails”
❖ Customizing the LLM
❖ Finetuning, transfer learning, routing layers

46
LLMOps: Managing Data Ingestion

❖ LLM applications need to handle large multimodal corpora: text,


PDFs, JSON, images, etc.
❖ 3 key sub-parts on handling such data:
❖ Chunking: Partition docs, pages, etc. into bite-sized pieces
❖ Embedding: Generate embeddings for chunks (using LLMs)
❖ Vector DBMS: Store; index; retrieve for queries; maintain

47
LLMOps: Managing Data Ingestion

48
https://www.linkedin.com/pulse/3-ways-vector-databases-take-your-llm-use-cases-next-level-mishra
LLMOps: Prompt Engineering

Q: What is a “prompt”?

❖ A prompt is an input to an LLM API


with some of these elements:
❖ Instruction: A specific task or
instruction for the model to do
❖ Context: External information or
additional context that can steer the
model to better responses
❖ Input Data: The input or question we
need a response for
❖ Output Indicator: The type or format
of the output
49
https://www.promptingguide.ai/introduction/elements
LLMOps: Prompt Engineering

Q: What is a “prompt engineering”?

50
LLMOps: Prompt Engineering

Q: What is a “prompt engineering”?

❖ A set of “best practices” (witchcraft?) and “guidelines” (spellbook?)


to craft prompts (spells?) for more effective use of LLMs
❖ Tricks and techniques abound, e.g,. “chain of thought”, “self-
consistency”, “few-shot”, etc.; take a generative NLP course
❖ Some useful practical references:

https://www.promptingguide.ai/techniques
https://platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results
https://aws.amazon.com/what-is/prompt-engineering/

51
Emerging LLMOps-Specific Tools

❖ Prompt management and data embedding management are


increasingly critical for LLM applications
❖ Tools for LLM appl. dev in flux; 2 recent popular examples:
❖ LangChain:
❖ Enables more structured compositions of LLM API invocations
called “chains” (akin to “transactions” in RDBMS applications)
❖ Stitch together more holistic “agents” for an application
❖ LlamaIndex:
❖ Mainly an aid for Retrieval-Augmented Generation and
similarity search
❖ APIs to parse, chunk, store, index, and query pieces
52
53
RAG and LlamaIndex

54
Regular MLOps vs. LLMOps

❖ Center of the universe? In-house trained model artifact vs. API-


based access to a pre-trained LLM
❖ Query input? Feature vector vs. rich customizable prompts
❖ Typical prediction targets? Discriminative/inferential targets vs.
generative content
❖ Tuning? Model selection aspects (hyper-par. tuning, arch.
tuning, feature eng., etc.) vs. in-context learning / prompt tuning
❖ Auxiliary data stores? Feature stores/ML platforms vs. Vector
DBMSs
❖ Cost/latency optimizations? In-house model/pipeline/resource
optimizations vs. Fixed LLM operating points or re-routing

55
Review Questions
1. Briefly explain 2 reasons why online prediction serving is typically
more challenging in practice than offline deployment.
2. Briefly describe 1 systems optimizations performed by both Clipper
and VLLM for prediction serving.
3. Briefly discuss 1 systems-level optimization amenable to both
offline ML deployment and online prediction serving.
4. Name 3 things that must be versioned for rigorous version control
in MLOps.
5. Briefly explain 2 reasons why ML monitoring is needed.
6. Briefly explain 2 reasons why federated ML is more challenging for
data scientists to reason about.
7. Briefly explain 1 way LLMOps deviates from regular MLOps.
8. Briefly explain 1 new technical concern in LLMOps that required
new tool development. 56

You might also like