0% found this document useful (0 votes)

13 views20 pages

Introduction Project Peach Ext

Project Peach is an initiative by VMware aimed at creating a comprehensive AI/ML stack on vSphere, integrating various machine learning services and frameworks. It focuses on enhancing the deployment and management of AI workloads through improved infrastructure and automation, addressing the needs of data scientists and ML engineers. The project includes partnerships with MLOps ISVs and aims to streamline the lifecycle of machine learning applications.

Uploaded by

xiveaxida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views20 pages

Introduction Project Peach Ext

Uploaded by

xiveaxida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Project Peach

Confidential │ © VMware, Inc.

Advanced Next-generation Self-service
analytics storefronts experiences
More applications and
solutions will be deployed

in the next five years

than in the last 40 years
Data-Defined Industrial Business process
Business IoT automation
Processes

Confidential │ © VMware, Inc. 2

Machine Learning System Pyramid

AI Services (a.k.a. Cloud AI Developer Services as per Gartner)

Sight,
lang., Cloud-hosted or containerized services via API/SDK Leading Vendors (Gartner):
…
Sight: vision, video intelligence ISV: H2O.ai, Aible, Salesforce, Prevision.io, …
Language: understanding, translation CSP: AWS, GCP, Azure, Ali, Baidu, Tencent
autoML Structured data: table, …

ML Services (a.k.a. Data Science and Machine Learning Platforms as per Gartner)
MLops Monitoring: Data Quality and Lineage Monitoring: Model Quality and Lineage

ML Frameworks and Infrastructure

Deep Learning VM images /
Containers

Tensorflow XG Boost
ML Frameworks Sklearn PyTorch
PaddlePaddle …

Nvidia GPU vGPU/MIG

Hardware VMware Bitfusion VirtAITech OrionX
(Resource sharing / Virtualization / Containerize) Project Thunder Nvidia Container

Confidential │ © VMware, Inc. 3

AWS AI/ML Landscape

Prepare ▶ Build ▶ Train & tune ▶ Deploy & manage

AI Service
Amazon Comprehend/Lex/Polly/Rekognition/Transcribe/Translate/…

SageMaker Studio / Console / SDK

SageMaker Groud Truth SageMaker Autopilot SageMaker Model Monitor

ML Service
SageMaker Data Wrangler SageMaker Notebooks SageMaker Experiments SageMaker Edge Manager
(SageMaker)
SageMaker Feature Store AWS Marketplace SageMaker Debugger Elastic Inference

SageMaker Processing SageMaker JumpStart Automatic Model Tuning SageMaker Neo

Deep Learning AMIs & Containers

ML framework
& Infrastructure
Elastic Inference Nvidia GPU EC2 (P3, P3dn, …) Inferentia EC2 (Inf1) / Trainium

Confidential │ © VMware, Inc. 4

Nvidia AI/ML Landscape
Prepare ▶ Build ▶ Train & tune ▶ Deploy & manage

TAO Toolkit
AI Service
Catalog

RAPIDS Base Command Platform Triton Inference Server

Partnership
Weights & Biases TensorRT

ML Service ClearML Core Scientifc

Determined AI Domino

iguazio Paperspace

Catalog

CUDA

ML framework
& Infrastructure GPU Operator
Nvidia Enterprise AI
OpenShift, et al vSphere

Nvidia GPU (vGPU/MIG) Platform (WS, DGX, Cloud)

Confidential │ © VMware, Inc. Nvidia Others 5

E2E Workflow to Deploy AI/ML Workload on vSphere
Step 1: vSphere Deployment Step 2: Setup DLA Step 3: Setup VM / K8s Cluster Step 4: Setup DLA Client / Env. Step 5: AI/ML Workload
Not specific to AI/ML Options to share/manage DLA Not specific to AI/ML

Nvidia GPU
(vGPU/MIG/PT)

3rd party repos

(e.g. NGC)
TKG K8s Clusters Pods with DLA
Bitfusion 4.x
(or other K8s options) Resources
MLOps Platform
(e.g. Kubeflow,
Domino)
vSphere Radium
Deployment (Planned)
ML & model
images

vAIA VM with DLA

Virtual Machines
(Planned) Resources
…

Confidential │ © VMware, Inc. 6

Modern ML/AI infrastructure is Needed
Following the footsteps of compute, storage and networking evolution

Agile Efficient Ease of integration Scalable

Rapid response to Increased utilization and Deployable with no Extensibility to support any
Data Scientists and efficiency of high cost disruption of current cloud, edge and carriers
ML Engineers for AI servers and reduce infrastructure, seamless
acceleration and operational costs integration to workflows
co-processors anytime and lifecycle
and anywhere

Confidential │ © VMware, Inc. 7

What problem are we trying to solve?
Create the blueprint for vSphere AI/ML Stack

Goals
Fill the gaps from end-to-end perspective
• Data, model pipeline and Lifecycle
• ML stack and ecosystem
• Lifecycle from customers’ view

Fit into AI/ML workload pattern

• Open-source
• Cloud-native

Non Goals
How to passthrough or virtualize DLA
devices

Confidential │ © VMware, Inc. 8

Project Peach Architecture

Model / Application Interface

• Integration with Project Taiga,
KubeFATE, AIverse, …
• LLM models
Model / Application Interface

I SV
OS
vSphere Machine Learning Extension
• (MVP) Kubeflow on vSphere MLOps ISVs
packaging and releases to fully vSphere Machine • (MVP) Onboard MLOps ISVs to
leverage vSphere Interface Learning Extension leverage vSphere Interface
• Ray, Support for VMs, and more … MLOps ISVs
(Kubeflow on • Develop a cert/validation program
and onboard more ISVs
• Deliverables: OSS, Flings, Showcase
vSphere)

vSphere Interface (autoscaler, monitoring, …) vSphere Interface

vSphere • (MVP) Autoscaler for GPU for
scheduling
• (MVP) GPU monitoring and quota
• (MVP) Streamline the lifecycle

Legend
• Support for vSphere VMs and other
K8S
OSS Proprietary
• Deliverables: vSphere enhancement
and Docs

Confidential │ © VMware, Inc. 9

Align with VMware Generative AI / LLM Solution Stack
BYOAI – Flexible stack allowing customers to run their AI tools/frameworks of choice on
vSphere and VCF

Customer Applications

🍑 Project Peach

Model /Application Interface

🍑 Project Peach
Kubeflow on
vSphere
MLOps ISVs
Data Services Manager

vSphere Interface (autoscaler, monitoring, …)

Virtual Machines Containers
Legend

vSphere 8 (with Tanzu optional) OSS Proprietary

Compute Storage Network GPU

Confidential │ © VMware, Inc. 10

Project Peach: Full Stack ML Platform Enablement
The vSphere with Tanzu Instance

Kubeflow release for vSphere

ML Platform • Full-functional OSS Kubeflow release for
Data Model Model
(Kubeflow, cnvrg.io, IAM vSphere and its packaging
Domino, …) Processing Training Serving
• Use TKGS autosacler for advanced GPU
management
• OS images1, including GPU TKG Cluster
Advanced GPU Management Service
drivers to support accelerators ML Pod GPU Operator
• Deliver Accelerator Add-ons GPU • GPU SDKs / services
like GPU Operator • Verify with Kubeflow and 2+ ISVs
TKG Service
GPU-Enabled OS GPU-Enabled Image
Image1 (libraries, et al)2
VMware Hosted ML Images2 for
VM Service to support GPUs
• Optimized ML frameworks and libraries
IaaS Service for Radium, Nvidia, Intel oneAPI, et al.
Kubeflow
IAM VM Service MinIO • ML models and pipelines
• Physical and Virtual Device Deployer
Accelerator Driver Support
• Vendor-specfic accelerator ML Platform IaaS service
enablement, e.g. Nvidia vGPU ESXi Accelerator
vGPU AH DVX • (IAM) Connect ML Platforms to pinniped
• Hardware Topologies from Enablement • (MinIO) Connent object storage to ML
Assignable Hardware
Platforms
• S-IOV DVX Support GPU CPU NIC NVMe

Confidential │ © VMware, Inc. 11

Kubeflow on vSphere

Prepare ▶ Build ▶ Train & Tune ▶ Deploy & Manage

Kubeflow Dashboard / SDK / …

FEAST ML Framework Operator Katib Tensorflow / Paddle / …

Spark Pipeline Metadata KFServing

ML Service
MinIO JupyterLab Fairing Knative

Istio / Argo / Prometheus / Dex

Kubernetes Operator / Kubernetes Device Plugin

Infrastructure
Multi-Cloud Nvidia Local VMC-Public
TKG Edge / Other K8s
AI Enterprise Cloud Clouds

Confidential │ © VMware, Inc. 12

01 02 03 04

Kubeflow on vSphere Kubeflow

on VMware Cloud
ML
Images
DLA
Operator
APIs
GPUaaS

Data Engineering Team ML Engineering Team Apps Engineering Team

Data Pipeline ML Pipeline App Pipeline

Data Data Data Feature Model Model

Model Serving
Ingestion Preprocessing Engineering Curation Development Training

Operators

Katib

Kubeflow Manifest
MLOps Platform

UI / Orchestration Dashboard / Notebook / Pipeline / Monitoring

Kubeflow Dashboard

Storage Artifacts / Meta Data / Storage

Kubeflow ML Metadata

GPU Bitfusion / Nvidia MIG

Infra

IT Team
vSphere / Kubernetes / Istio / Dex

Kubeflow on vSphere – OSS Update
Official Kubeflow Distribution

https://www.kubeflow.org/docs/started/installing-kubeflow/ https://vmware.github.io/vSphere-machine-learning-extension/

MLOps level 1: ML pipeline automation GCP

The goal of level 1 is to perform

continuous training (CT) of the model
by automating the ML pipeline; this
lets you achieve continuous delivery
of model prediction service. You need
to introduce automated data and
model validation steps to the pipeline,
as well as pipeline triggers and
metadata management.

Highlights the characteristics of the

MLOps level 1: Rapid experiment, CT
of the model in production,
Experimental-operational symmetry,
Modularized code for components
and pipelines, Continuous delivery of
models, Pipeline deployment

Vision : Putting it all together – Infrastructure Side
Two layered GPU management and scheduling
Emerging

Workloads Render
NFV VDI HPC AI/ML
GPU ETL
AI/ML HPC
Farm DB Acc

Pod Pod Pod

Bare-
VM VM VM
metal
Phy/Virt Phy/Virt Phy/Virt Phy/Virt
Device(s) Device(s) Device(s)
Pod Pod Pod
Device(s)

GPU Autoscaling
Gang Scheduling
SW Virtualization Integration
GPU Monitoring
GPU Quota

DRS

Data Scientist Data Scientist

Current configuration does not meet the DS schedules ML workloads directly with the
requirement. DS issues the request to the team resource requirement.
STEP 1 to re-configure.

GPU Scheduling Service

VI Admin
Setup vmClass and vmclassbinding per the
request from DS. Even recreate MIG instance
STEP 2 for the MIG case. VS
Complex workflow Simple workflow
DevOps Manual process Fully automated
Configure tkc resources with new createdError-prone
GPU
nodes. /low efficiency High efficiency
~ hours to days ~ 30 mins
STEP 3

Data Scientist Data Scientist

Schedule ML workloads to acquired resources. Complete the ML job.

STEP 4

vSphere Interface for MLOps
Based on to-be released Autoscaler (Q2 FY24)

TKGS Cluster

Autoscaler Pod
Pod Pod VM Pod VM VM
Worker VM
Control

Some Namespace Some Namespace

Supervisor Cluster

IaaS Platform
NCP CSI TKGS
NSX Container Plugin Container Storage Interface
VM Operator Tanzu Kubernetes Grid Service
VM Service
CAPW
Net Operator Cluster API Provider for WCP

NSX CNS VPXD WCPSVC Content Library

vSphere SDDC

01 02 03 04

ML Containers, Models and IDE Kubeflow

on vSphere
ML
Images & IDE
Lifecycle &
Pipeline
Device Plugin &
Telemetry

Support higher level ML

and AI services
on vSphere ML and AI Services from 3rd Pre-trained models
party

Support mainstream ML
framworks

Support both VM and

container

TKG Container ESXi VM

Support any DLA

PT/Virtuliazation

PT/vGPU/MIG/… Bitfusion Radium vAIA

• Get best performance on vSphere; no tuning required • 300+ built-in pretrained tabular, text and vision models
• Customizable container images • 150+ fine-tunable models
• GitHub open-source project coming soon • Inference and training SDKs

Thank You

VMware VSphere Install Configure Manage V8 STUDENT Manual
No ratings yet
VMware VSphere Install Configure Manage V8 STUDENT Manual
626 pages
Ug1703 Vitis Ai Developer Guide WTMKX
No ratings yet
Ug1703 Vitis Ai Developer Guide WTMKX
137 pages
Learning AWS
From Everand
Learning AWS
Aurobindo Sarkar
4/5 (4)
Cloud NFVI Overview: Ericsson Indonesia
100% (1)
Cloud NFVI Overview: Ericsson Indonesia
27 pages
Containers or VMs - Deploy AI Workloads With Ease - 1647197291151001kben
No ratings yet
Containers or VMs - Deploy AI Workloads With Ease - 1647197291151001kben
26 pages
Red Hat & NVIDIA For FSI - Final
No ratings yet
Red Hat & NVIDIA For FSI - Final
18 pages
MLOps Specialization Course January 2024
No ratings yet
MLOps Specialization Course January 2024
24 pages
Deep Learning With TensorFlow and Spark Using GPUs and Docker Containers Presentation
No ratings yet
Deep Learning With TensorFlow and Spark Using GPUs and Docker Containers Presentation
38 pages
Vsphere Automation SDK 801 Programming Guide
No ratings yet
Vsphere Automation SDK 801 Programming Guide
286 pages
MLOps Specialization Course January 2024!5!15
No ratings yet
MLOps Specialization Course January 2024!5!15
11 pages
White Paper MicroK8s RA With Supermicro
No ratings yet
White Paper MicroK8s RA With Supermicro
17 pages
Canonical MLOps Toolkit
No ratings yet
Canonical MLOps Toolkit
17 pages
Slides Deep Learning On AWS With NVIDIA From Training To Deployment
No ratings yet
Slides Deep Learning On AWS With NVIDIA From Training To Deployment
48 pages
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
No ratings yet
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
29 pages
NVIDIA and VMware Deliver AI-Ready Enterprise Platform - NVIDIA Blog
No ratings yet
NVIDIA and VMware Deliver AI-Ready Enterprise Platform - NVIDIA Blog
10 pages
Solution Overview Base Command Manager
No ratings yet
Solution Overview Base Command Manager
3 pages
OCTO All-Hands FY24 Q2 - 2022-06-28
No ratings yet
OCTO All-Hands FY24 Q2 - 2022-06-28
36 pages
MLOps Specialization Course April 2024
100% (1)
MLOps Specialization Course April 2024
25 pages
NVT Certification Exam Study Guide Aiio Web
No ratings yet
NVT Certification Exam Study Guide Aiio Web
6 pages
Inference Whitepaper Mar23 Update
No ratings yet
Inference Whitepaper Mar23 Update
42 pages
M4 - Production ML Pipelines With Kubeflow Slides
No ratings yet
M4 - Production ML Pipelines With Kubeflow Slides
28 pages
Vgpu Vs Mig Perf
No ratings yet
Vgpu Vs Mig Perf
16 pages
VMW - Ita Flyr DCVCTS Uslet 101
No ratings yet
VMW - Ita Flyr DCVCTS Uslet 101
2 pages
Dell EMC RS For VDI Mixed Workload
No ratings yet
Dell EMC RS For VDI Mixed Workload
2 pages
Event - Vsphere 7.0 Day
No ratings yet
Event - Vsphere 7.0 Day
71 pages
? 2-Month Roadmap To Master DevOps Tools For AI Projects (Excluding Git & GitHub)
No ratings yet
? 2-Month Roadmap To Master DevOps Tools For AI Projects (Excluding Git & GitHub)
5 pages
VMW PPT Library Icons-Diagrams 2q12 2 of 3
No ratings yet
VMW PPT Library Icons-Diagrams 2q12 2 of 3
35 pages
Nvitu 230307121950 c3b682cc
No ratings yet
Nvitu 230307121950 c3b682cc
24 pages
SS KTWorkshop
No ratings yet
SS KTWorkshop
254 pages
Cloud-Native AI - ML Pipelines - Best Practices For Continuous Integration, Deployment, and Monitoring in Enterprise Applications
No ratings yet
Cloud-Native AI - ML Pipelines - Best Practices For Continuous Integration, Deployment, and Monitoring in Enterprise Applications
54 pages
Dell Emc Poweredge Servers With Nvidia Gpus and Vmware Vsphere
No ratings yet
Dell Emc Poweredge Servers With Nvidia Gpus and Vmware Vsphere
17 pages
Vmware Nvidia Presentation
No ratings yet
Vmware Nvidia Presentation
38 pages
Vmware Product Guide February 2024
No ratings yet
Vmware Product Guide February 2024
115 pages
Netlab Vmware Vsphere Oss 8 Pod Install Guide
No ratings yet
Netlab Vmware Vsphere Oss 8 Pod Install Guide
28 pages
Deploy Machine Learning Models
100% (1)
Deploy Machine Learning Models
45 pages
MLOps Specialization Course
No ratings yet
MLOps Specialization Course
29 pages
Kubernetes For Mlops Scaling Enterprise Machine Learning Deep Learning and Ai Compress
No ratings yet
Kubernetes For Mlops Scaling Enterprise Machine Learning Deep Learning and Ai Compress
31 pages
V Sphere Technical Overview Presentation
No ratings yet
V Sphere Technical Overview Presentation
73 pages
AI Lab
No ratings yet
AI Lab
30 pages
FCI NCP Product Service Offering en V4
No ratings yet
FCI NCP Product Service Offering en V4
22 pages
AI Infrastructure and Operations Outline 2025
No ratings yet
AI Infrastructure and Operations Outline 2025
4 pages
AI Practice Profile
No ratings yet
AI Practice Profile
14 pages
UBUNTU - Canonical - Datasheet For MLOps Workshop
No ratings yet
UBUNTU - Canonical - Datasheet For MLOps Workshop
2 pages
Fast Track Innovation With Generative AI and Machine Learning
No ratings yet
Fast Track Innovation With Generative AI and Machine Learning
18 pages
VSICM8 M02 Overview 1123
No ratings yet
VSICM8 M02 Overview 1123
33 pages
Nvidia Update For Lenovo
No ratings yet
Nvidia Update For Lenovo
30 pages
Technical Brief Gpu Positioning Virtualized Compute and Graphics Workloads
No ratings yet
Technical Brief Gpu Positioning Virtualized Compute and Graphics Workloads
32 pages
Best Practices For Implementing Machine Learning On Google Cloud - Cloud Architecture Center
No ratings yet
Best Practices For Implementing Machine Learning On Google Cloud - Cloud Architecture Center
21 pages
Cloud Computing M2 Ch3: Virtual Machines and Virtualization of Clusters and Data Centers
No ratings yet
Cloud Computing M2 Ch3: Virtual Machines and Virtualization of Clusters and Data Centers
97 pages
STWP Presentation
No ratings yet
STWP Presentation
15 pages
Enterprise-Grade On-Premises LLM Inference Server
No ratings yet
Enterprise-Grade On-Premises LLM Inference Server
5 pages
Vmware Vrealize Automation 8
No ratings yet
Vmware Vrealize Automation 8
48 pages
DeepSeek - VM WARE
No ratings yet
DeepSeek - VM WARE
6 pages
How GenAI Impacts Infrastructure Strategies
No ratings yet
How GenAI Impacts Infrastructure Strategies
39 pages
Notesv 1
No ratings yet
Notesv 1
6 pages
Report
No ratings yet
Report
14 pages
Nvidia Ai Enterprise User Guide
No ratings yet
Nvidia Ai Enterprise User Guide
100 pages
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-4
No ratings yet
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-4
11 pages
Implementing Linkerd Service Mesh
From Everand
Implementing Linkerd Service Mesh
Kimiko Lee
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
K8ssandra Workshop Feb 2021
No ratings yet
K8ssandra Workshop Feb 2021
80 pages
Kubernetes
No ratings yet
Kubernetes
5 pages
Gokulraj Resume
No ratings yet
Gokulraj Resume
2 pages
Devops Plan
No ratings yet
Devops Plan
3 pages
Cours 3
No ratings yet
Cours 3
3 pages
00 Welcome CSCI GA.2820 001
No ratings yet
00 Welcome CSCI GA.2820 001
42 pages
Ramsai Chintakunta: EBS, VPC, ELB, AMI, Route 53, Cloud Watch Etc..
No ratings yet
Ramsai Chintakunta: EBS, VPC, ELB, AMI, Route 53, Cloud Watch Etc..
3 pages
Vmware Validated Design 62 SDDC Upgrade
No ratings yet
Vmware Validated Design 62 SDDC Upgrade
79 pages
Kubernetes Vs Docker - A Step by Step Guide To Learn and Master Well
No ratings yet
Kubernetes Vs Docker - A Step by Step Guide To Learn and Master Well
247 pages
OpenShift - Container - Platform 4.17 Architecture en US
No ratings yet
OpenShift - Container - Platform 4.17 Architecture en US
82 pages
API Management, Integration, and DataPower Gateway Level 2 Quiz Attempt Review
No ratings yet
API Management, Integration, and DataPower Gateway Level 2 Quiz Attempt Review
13 pages
CKAD Cheat Sheet
No ratings yet
CKAD Cheat Sheet
1 page
Scaleway Kubernetes Cheatsheet 1
No ratings yet
Scaleway Kubernetes Cheatsheet 1
1 page
Azure Developer Intro
No ratings yet
Azure Developer Intro
937 pages
ИТ-ТЕКСТЫ ДЛЯ ПЕРЕВОДА
No ratings yet
ИТ-ТЕКСТЫ ДЛЯ ПЕРЕВОДА
48 pages
Shardul Powale General Resume
No ratings yet
Shardul Powale General Resume
2 pages
DEVOPSSyllabus
No ratings yet
DEVOPSSyllabus
8 pages
Chapitre Edge IoT Kubernetes
No ratings yet
Chapitre Edge IoT Kubernetes
23 pages
Vineet Deshmukh Resume
No ratings yet
Vineet Deshmukh Resume
2 pages
The Guide To Day-2 Operations in Kubernetes
No ratings yet
The Guide To Day-2 Operations in Kubernetes
35 pages
SAP CPI Quiz - DeepSeek Without Documentation
No ratings yet
SAP CPI Quiz - DeepSeek Without Documentation
11 pages
Keysight CyPerf Deployment Guide
No ratings yet
Keysight CyPerf Deployment Guide
20 pages
Hamza A Wajeeh Resume Updated
No ratings yet
Hamza A Wajeeh Resume Updated
2 pages
Integrating Open Source Unity Catalog With GCP Workloads - by Murli Krishnan - Google Cloud - Community - Nov, 2024 - Medium
No ratings yet
Integrating Open Source Unity Catalog With GCP Workloads - by Murli Krishnan - Google Cloud - Community - Nov, 2024 - Medium
19 pages
Troubleshooting Kubectl Error The Connection To The Server X.X.X.X 6443 Was Refused - Did You Specify The Right Host or Port - The Geek Diary
No ratings yet
Troubleshooting Kubectl Error The Connection To The Server X.X.X.X 6443 Was Refused - Did You Specify The Right Host or Port - The Geek Diary
2 pages
Developer Guide Into
No ratings yet
Developer Guide Into
79 pages
IBM Software Engineering Roles in Hyderabad (June 2025)
No ratings yet
IBM Software Engineering Roles in Hyderabad (June 2025)
5 pages
Github Actions CICD Pipeline
No ratings yet
Github Actions CICD Pipeline
13 pages
Aws Dev Ops Shells Rip T Interview Questions
No ratings yet
Aws Dev Ops Shells Rip T Interview Questions
6 pages
+ - OSDI2020-FIRM - An Intelligent Fine-Grained Resource Management Framework For SLO-Oriented Microservices
No ratings yet
+ - OSDI2020-FIRM - An Intelligent Fine-Grained Resource Management Framework For SLO-Oriented Microservices
22 pages

Introduction Project Peach Ext

Uploaded by

Introduction Project Peach Ext

Uploaded by

Project Peach

Confidential │ © VMware, Inc.

in the next five years

Confidential │ © VMware, Inc. 2

AI Services (a.k.a. Cloud AI Developer Services as per Gartner)

ML Frameworks and Infrastructure

Nvidia GPU vGPU/MIG

Confidential │ © VMware, Inc. 3

Prepare ▶ Build ▶ Train & tune ▶ Deploy & manage

SageMaker Studio / Console / SDK

SageMaker Groud Truth SageMaker Autopilot SageMaker Model Monitor

SageMaker Processing SageMaker JumpStart Automatic Model Tuning SageMaker Neo

Deep Learning AMIs & Containers

Confidential │ © VMware, Inc. 4

RAPIDS Base Command Platform Triton Inference Server

ML Service ClearML Core Scientifc

Nvidia GPU (vGPU/MIG) Platform (WS, DGX, Cloud)

Confidential │ © VMware, Inc. Nvidia Others 5

3rd party repos

vAIA VM with DLA

Confidential │ © VMware, Inc. 6

Agile Efficient Ease of integration Scalable

Confidential │ © VMware, Inc. 7

Fit into AI/ML workload pattern

Confidential │ © VMware, Inc. 8

Model / Application Interface

vSphere Interface (autoscaler, monitoring, …) vSphere Interface

Confidential │ © VMware, Inc. 9

Model /Application Interface

vSphere Interface (autoscaler, monitoring, …)

vSphere 8 (with Tanzu optional) OSS Proprietary

Compute Storage Network GPU

Confidential │ © VMware, Inc. 10

Kubeflow release for vSphere

Confidential │ © VMware, Inc. 11

Prepare ▶ Build ▶ Train & Tune ▶ Deploy & Manage

Kubeflow Dashboard / SDK / …

FEAST ML Framework Operator Katib Tensorflow / Paddle / …

Spark Pipeline Metadata KFServing

Istio / Argo / Prometheus / Dex

Kubernetes Operator / Kubernetes Device Plugin

Confidential │ © VMware, Inc. 12

Kubeflow on vSphere Kubeflow

Data Engineering Team ML Engineering Team Apps Engineering Team

Data Pipeline ML Pipeline App Pipeline

Data Data Data Feature Model Model

UI / Orchestration Dashboard / Notebook / Pipeline / Monitoring

Storage Artifacts / Meta Data / Storage

GPU Bitfusion / Nvidia MIG

Confidential │ © VMware, Inc. 13

Confidential │ © VMware, Inc. 14

The goal of level 1 is to perform

Highlights the characteristics of the

Confidential │ © VMware, Inc. 15

Pod Pod Pod

Data Scientist Data Scientist

GPU Scheduling Service

Data Scientist Data Scientist

Confidential │ © VMware, Inc. 17

Some Namespace Some Namespace

NSX CNS VPXD WCPSVC Content Library

Confidential │ © VMware, Inc. 18

ML Containers, Models and IDE Kubeflow

Support higher level ML

Support both VM and

TKG Container ESXi VM

Support any DLA

PT/vGPU/MIG/… Bitfusion Radium vAIA

Confidential │ © VMware, Inc. 19

Confidential │ © VMware, Inc.

You might also like