0% found this document useful (0 votes)
13 views20 pages

Introduction Project Peach Ext

Project Peach is an initiative by VMware aimed at creating a comprehensive AI/ML stack on vSphere, integrating various machine learning services and frameworks. It focuses on enhancing the deployment and management of AI workloads through improved infrastructure and automation, addressing the needs of data scientists and ML engineers. The project includes partnerships with MLOps ISVs and aims to streamline the lifecycle of machine learning applications.

Uploaded by

xiveaxida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Introduction Project Peach Ext

Project Peach is an initiative by VMware aimed at creating a comprehensive AI/ML stack on vSphere, integrating various machine learning services and frameworks. It focuses on enhancing the deployment and management of AI workloads through improved infrastructure and automation, addressing the needs of data scientists and ML engineers. The project includes partnerships with MLOps ISVs and aims to streamline the lifecycle of machine learning applications.

Uploaded by

xiveaxida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Project Peach

Confidential │ © VMware, Inc.


Advanced Next-generation Self-service
analytics storefronts experiences
More applications and
solutions will be deployed

in the next five years


than in the last 40 years
Data-Defined Industrial Business process
Business IoT automation
Processes

Confidential │ © VMware, Inc. 2


Machine Learning System Pyramid

AI Services (a.k.a. Cloud AI Developer Services as per Gartner)


Sight,
lang., Cloud-hosted or containerized services via API/SDK Leading Vendors (Gartner):

Sight: vision, video intelligence ISV: H2O.ai, Aible, Salesforce, Prevision.io, …
Language: understanding, translation CSP: AWS, GCP, Azure, Ali, Baidu, Tencent
autoML Structured data: table, …

ML Services (a.k.a. Data Science and Machine Learning Platforms as per Gartner)
MLops Monitoring: Data Quality and Lineage Monitoring: Model Quality and Lineage

ML Frameworks and Infrastructure


Deep Learning VM images /
Containers

Tensorflow XG Boost
ML Frameworks Sklearn PyTorch
PaddlePaddle …

Nvidia GPU vGPU/MIG


Hardware VMware Bitfusion VirtAITech OrionX
(Resource sharing / Virtualization / Containerize) Project Thunder Nvidia Container

Confidential │ © VMware, Inc. 3


AWS AI/ML Landscape

Prepare ▶ Build ▶ Train & tune ▶ Deploy & manage

AI Service
Amazon Comprehend/Lex/Polly/Rekognition/Transcribe/Translate/…

SageMaker Studio / Console / SDK

SageMaker Groud Truth SageMaker Autopilot SageMaker Model Monitor


ML Service
SageMaker Data Wrangler SageMaker Notebooks SageMaker Experiments SageMaker Edge Manager
(SageMaker)
SageMaker Feature Store AWS Marketplace SageMaker Debugger Elastic Inference

SageMaker Processing SageMaker JumpStart Automatic Model Tuning SageMaker Neo

Deep Learning AMIs & Containers


ML framework
& Infrastructure
Elastic Inference Nvidia GPU EC2 (P3, P3dn, …) Inferentia EC2 (Inf1) / Trainium

Confidential │ © VMware, Inc. 4


Nvidia AI/ML Landscape
Prepare ▶ Build ▶ Train & tune ▶ Deploy & manage

TAO Toolkit
AI Service
Catalog

RAPIDS Base Command Platform Triton Inference Server

Partnership
Weights & Biases TensorRT

ML Service ClearML Core Scientifc

Determined AI Domino

iguazio Paperspace

Catalog

CUDA

ML framework
& Infrastructure GPU Operator
Nvidia Enterprise AI
OpenShift, et al vSphere

Nvidia GPU (vGPU/MIG) Platform (WS, DGX, Cloud)

Confidential │ © VMware, Inc. Nvidia Others 5


E2E Workflow to Deploy AI/ML Workload on vSphere
Step 1: vSphere Deployment Step 2: Setup DLA Step 3: Setup VM / K8s Cluster Step 4: Setup DLA Client / Env. Step 5: AI/ML Workload
Not specific to AI/ML Options to share/manage DLA Not specific to AI/ML

Nvidia GPU
(vGPU/MIG/PT)

3rd party repos


(e.g. NGC)
TKG K8s Clusters Pods with DLA
Bitfusion 4.x
(or other K8s options) Resources
MLOps Platform
(e.g. Kubeflow,
Domino)
vSphere Radium
Deployment (Planned)
ML & model
images

vAIA VM with DLA


Virtual Machines
(Planned) Resources

Confidential │ © VMware, Inc. 6


Modern ML/AI infrastructure is Needed
Following the footsteps of compute, storage and networking evolution

Agile Efficient Ease of integration Scalable


Rapid response to Increased utilization and Deployable with no Extensibility to support any
Data Scientists and efficiency of high cost disruption of current cloud, edge and carriers
ML Engineers for AI servers and reduce infrastructure, seamless
acceleration and operational costs integration to workflows
co-processors anytime and lifecycle
and anywhere

Confidential │ © VMware, Inc. 7


What problem are we trying to solve?
Create the blueprint for vSphere AI/ML Stack

Goals
Fill the gaps from end-to-end perspective
• Data, model pipeline and Lifecycle
• ML stack and ecosystem
• Lifecycle from customers’ view

Fit into AI/ML workload pattern


• Open-source
• Cloud-native

Non Goals
How to passthrough or virtualize DLA
devices

Confidential │ © VMware, Inc. 8


Project Peach Architecture

Model / Application Interface


• Integration with Project Taiga,
KubeFATE, AIverse, …
• LLM models
Model / Application Interface

I SV
OS
vSphere Machine Learning Extension
• (MVP) Kubeflow on vSphere MLOps ISVs
packaging and releases to fully vSphere Machine • (MVP) Onboard MLOps ISVs to
leverage vSphere Interface Learning Extension leverage vSphere Interface
• Ray, Support for VMs, and more … MLOps ISVs
(Kubeflow on • Develop a cert/validation program
and onboard more ISVs
• Deliverables: OSS, Flings, Showcase
vSphere)

vSphere Interface (autoscaler, monitoring, …) vSphere Interface


vSphere • (MVP) Autoscaler for GPU for
scheduling
• (MVP) GPU monitoring and quota
• (MVP) Streamline the lifecycle

Legend
• Support for vSphere VMs and other
K8S
OSS Proprietary
• Deliverables: vSphere enhancement
and Docs

Confidential │ © VMware, Inc. 9


Align with VMware Generative AI / LLM Solution Stack
BYOAI – Flexible stack allowing customers to run their AI tools/frameworks of choice on
vSphere and VCF

Customer Applications

🍑 Project Peach

Model /Application Interface

🍑 Project Peach
Kubeflow on
vSphere
MLOps ISVs
Data Services Manager

vSphere Interface (autoscaler, monitoring, …)


Virtual Machines Containers
Legend

vSphere 8 (with Tanzu optional) OSS Proprietary

Compute Storage Network GPU

Confidential │ © VMware, Inc. 10


Project Peach: Full Stack ML Platform Enablement
The vSphere with Tanzu Instance

Kubeflow release for vSphere


ML Platform • Full-functional OSS Kubeflow release for
Data Model Model
(Kubeflow, cnvrg.io, IAM vSphere and its packaging
Domino, …) Processing Training Serving
• Use TKGS autosacler for advanced GPU
management
• OS images1, including GPU TKG Cluster
Advanced GPU Management Service
drivers to support accelerators ML Pod GPU Operator
• Deliver Accelerator Add-ons GPU • GPU SDKs / services
like GPU Operator • Verify with Kubeflow and 2+ ISVs
TKG Service
GPU-Enabled OS GPU-Enabled Image
Image1 (libraries, et al)2
VMware Hosted ML Images2 for
VM Service to support GPUs
• Optimized ML frameworks and libraries
IaaS Service for Radium, Nvidia, Intel oneAPI, et al.
Kubeflow
IAM VM Service MinIO • ML models and pipelines
• Physical and Virtual Device Deployer
Accelerator Driver Support
• Vendor-specfic accelerator ML Platform IaaS service
enablement, e.g. Nvidia vGPU ESXi Accelerator
vGPU AH DVX • (IAM) Connect ML Platforms to pinniped
• Hardware Topologies from Enablement • (MinIO) Connent object storage to ML
Assignable Hardware
Platforms
• S-IOV DVX Support GPU CPU NIC NVMe

Confidential │ © VMware, Inc. 11


Kubeflow on vSphere

Prepare ▶ Build ▶ Train & Tune ▶ Deploy & Manage

Kubeflow Dashboard / SDK / …

FEAST ML Framework Operator Katib Tensorflow / Paddle / …

Spark Pipeline Metadata KFServing

ML Service
MinIO JupyterLab Fairing Knative

Istio / Argo / Prometheus / Dex

Kubernetes Operator / Kubernetes Device Plugin

Infrastructure
Multi-Cloud Nvidia Local VMC-Public
TKG Edge / Other K8s
AI Enterprise Cloud Clouds

Confidential │ © VMware, Inc. 12


01 02 03 04

Kubeflow on vSphere Kubeflow


on VMware Cloud
ML
Images
DLA
Operator
APIs
GPUaaS

Data Engineering Team ML Engineering Team Apps Engineering Team

Data Pipeline ML Pipeline App Pipeline

Data Data Data Feature Model Model


Model Serving
Ingestion Preprocessing Engineering Curation Development Training

Operators

Katib

Kubeflow Manifest
MLOps Platform

UI / Orchestration Dashboard / Notebook / Pipeline / Monitoring

Kubeflow Dashboard

Storage Artifacts / Meta Data / Storage

Kubeflow ML Metadata

GPU Bitfusion / Nvidia MIG


Infra

IT Team
vSphere / Kubernetes / Istio / Dex

Confidential │ © VMware, Inc. 13


Kubeflow on vSphere – OSS Update
Official Kubeflow Distribution

https://www.kubeflow.org/docs/started/installing-kubeflow/ https://vmware.github.io/vSphere-machine-learning-extension/

Confidential │ © VMware, Inc. 14


MLOps level 1: ML pipeline automation GCP

The goal of level 1 is to perform


continuous training (CT) of the model
by automating the ML pipeline; this
lets you achieve continuous delivery
of model prediction service. You need
to introduce automated data and
model validation steps to the pipeline,
as well as pipeline triggers and
metadata management.

Highlights the characteristics of the


MLOps level 1: Rapid experiment, CT
of the model in production,
Experimental-operational symmetry,
Modularized code for components
and pipelines, Continuous delivery of
models, Pipeline deployment

Confidential │ © VMware, Inc. 15


Vision : Putting it all together – Infrastructure Side
Two layered GPU management and scheduling
Emerging

Workloads Render
NFV VDI HPC AI/ML
GPU ETL
AI/ML HPC
Farm DB Acc

Pod Pod Pod

Bare-
VM VM VM
metal
Phy/Virt Phy/Virt Phy/Virt Phy/Virt
Device(s) Device(s) Device(s)
Pod Pod Pod
Device(s)

GPU Autoscaling
Gang Scheduling
SW Virtualization Integration
GPU Monitoring
GPU Quota

DRS

r r
l uste l uste
r eC r eC
p he p he
vS vS
Confidential │ © VMware, Inc. 16
Painpoints and Expected UX

Data Scientist Data Scientist


Current configuration does not meet the DS schedules ML workloads directly with the
requirement. DS issues the request to the team resource requirement.
STEP 1 to re-configure.

GPU Scheduling Service


VI Admin
Setup vmClass and vmclassbinding per the
request from DS. Even recreate MIG instance
STEP 2 for the MIG case. VS
Complex workflow Simple workflow
DevOps Manual process Fully automated
Configure tkc resources with new createdError-prone
GPU
nodes. /low efficiency High efficiency
~ hours to days ~ 30 mins
STEP 3

Data Scientist Data Scientist


Schedule ML workloads to acquired resources. Complete the ML job.

STEP 4

Confidential │ © VMware, Inc. 17


vSphere Interface for MLOps
Based on to-be released Autoscaler (Q2 FY24)

TKGS Cluster

Autoscaler Pod
Pod Pod VM Pod VM VM
Worker VM
Control

Some Namespace Some Namespace

Supervisor Cluster

IaaS Platform
NCP CSI TKGS
NSX Container Plugin Container Storage Interface
VM Operator Tanzu Kubernetes Grid Service
VM Service
CAPW
Net Operator Cluster API Provider for WCP

NSX CNS VPXD WCPSVC Content Library

vSphere SDDC

Confidential │ © VMware, Inc. 18


01 02 03 04

ML Containers, Models and IDE Kubeflow


on vSphere
ML
Images & IDE
Lifecycle &
Pipeline
Device Plugin &
Telemetry

Support higher level ML


and AI services
on vSphere ML and AI Services from 3rd Pre-trained models
party

Support mainstream ML
framworks

Support both VM and


container

TKG Container ESXi VM

Support any DLA


PT/Virtuliazation

PT/vGPU/MIG/… Bitfusion Radium vAIA

• Get best performance on vSphere; no tuning required • 300+ built-in pretrained tabular, text and vision models
• Customizable container images • 150+ fine-tunable models
• GitHub open-source project coming soon • Inference and training SDKs

Confidential │ © VMware, Inc. 19


Thank You

Confidential │ © VMware, Inc.

You might also like