Introduction Project Peach Ext
Introduction Project Peach Ext
ML Services (a.k.a. Data Science and Machine Learning Platforms as per Gartner)
MLops Monitoring: Data Quality and Lineage Monitoring: Model Quality and Lineage
Tensorflow XG Boost
ML Frameworks Sklearn PyTorch
PaddlePaddle …
AI Service
Amazon Comprehend/Lex/Polly/Rekognition/Transcribe/Translate/…
TAO Toolkit
AI Service
Catalog
Partnership
Weights & Biases TensorRT
Determined AI Domino
iguazio Paperspace
Catalog
CUDA
ML framework
& Infrastructure GPU Operator
Nvidia Enterprise AI
OpenShift, et al vSphere
Nvidia GPU
(vGPU/MIG/PT)
Goals
Fill the gaps from end-to-end perspective
• Data, model pipeline and Lifecycle
• ML stack and ecosystem
• Lifecycle from customers’ view
Non Goals
How to passthrough or virtualize DLA
devices
I SV
OS
vSphere Machine Learning Extension
• (MVP) Kubeflow on vSphere MLOps ISVs
packaging and releases to fully vSphere Machine • (MVP) Onboard MLOps ISVs to
leverage vSphere Interface Learning Extension leverage vSphere Interface
• Ray, Support for VMs, and more … MLOps ISVs
(Kubeflow on • Develop a cert/validation program
and onboard more ISVs
• Deliverables: OSS, Flings, Showcase
vSphere)
Legend
• Support for vSphere VMs and other
K8S
OSS Proprietary
• Deliverables: vSphere enhancement
and Docs
Customer Applications
🍑 Project Peach
🍑 Project Peach
Kubeflow on
vSphere
MLOps ISVs
Data Services Manager
ML Service
MinIO JupyterLab Fairing Knative
Infrastructure
Multi-Cloud Nvidia Local VMC-Public
TKG Edge / Other K8s
AI Enterprise Cloud Clouds
Operators
Katib
Kubeflow Manifest
MLOps Platform
Kubeflow Dashboard
Kubeflow ML Metadata
IT Team
vSphere / Kubernetes / Istio / Dex
https://www.kubeflow.org/docs/started/installing-kubeflow/ https://vmware.github.io/vSphere-machine-learning-extension/
Workloads Render
NFV VDI HPC AI/ML
GPU ETL
AI/ML HPC
Farm DB Acc
Bare-
VM VM VM
metal
Phy/Virt Phy/Virt Phy/Virt Phy/Virt
Device(s) Device(s) Device(s)
Pod Pod Pod
Device(s)
GPU Autoscaling
Gang Scheduling
SW Virtualization Integration
GPU Monitoring
GPU Quota
DRS
r r
l uste l uste
r eC r eC
p he p he
vS vS
Confidential │ © VMware, Inc. 16
Painpoints and Expected UX
STEP 4
TKGS Cluster
Autoscaler Pod
Pod Pod VM Pod VM VM
Worker VM
Control
Supervisor Cluster
IaaS Platform
NCP CSI TKGS
NSX Container Plugin Container Storage Interface
VM Operator Tanzu Kubernetes Grid Service
VM Service
CAPW
Net Operator Cluster API Provider for WCP
vSphere SDDC
Support mainstream ML
framworks
• Get best performance on vSphere; no tuning required • 300+ built-in pretrained tabular, text and vision models
• Customizable container images • 150+ fine-tunable models
• GitHub open-source project coming soon • Inference and training SDKs