0% found this document useful (0 votes)

12 views25 pages

Victoryap - One Cluster To Rule Them All

The document discusses using Ray for building distributed machine learning applications in the cloud. It covers how Ray provides a simple API and can autoscale clusters on cloud providers using technologies like Kubernetes, Karpenter and spot instances. The document includes a demo of using Ray to preprocess and train models on different instance types and shows how the cluster resources change.

Uploaded by

ku.madan05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Victoryap - One Cluster To Rule Them All

Uploaded by

ku.madan05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

One Cluster to Rule

Them All
ML on the Cloud
Victor Yap
MLOps Engineer

Rev.com / Rev.ai
Speech Recognition for Customers
Autoscaling

Rightsizing

Spot Instances

Environment Management
Ray

Kubernetes

Karpenter
Ray - “a simple, universal API
for building distributed applications”
AWS Batch?

Slurm?
Ray is Python-Friendly
ray.init() # starts a local cluster in one line

@ray.remote(num_cpus=1, memory=1024 ** 3)
def preprocess(data):
pass # do stuff

@ray.remote(num_gpus=1, accelerator_type="p2")
def train():
pass # do stuff
Kubernetes
Karpenter
Demo
import ray

# ray.init() will use the local system to run a cluster

ray.init()

# ray.init accepts an address to connect to a cluster

ray.init("ray://127.0.0.1:10001")
import time
from contextlib import contextmanager

@contextmanager
def timer():
"""Context manager to measure running time of code."""
start = time.time()
yield
time_elapsed = round(time.time() - start)
print(f"timer: took {time_elapsed} seconds")
def get_instance_type():
"""Returns what instance type this function is running on."""
import requests
token = requests.put(
"http://169.254.169.254/latest/api/token",
headers={"X-aws-ec2-metadata-token-ttl-seconds": "21600"}
).text
instance_type = requests.get(
"http://169.254.169.254/latest/meta-data/instance-type",
headers={"X-aws-ec2-metadata-token": token},
).text
return instance_type
def print_cluster_resources():
"""Prints the CPUs, memory and GPUs of the current Ray cluster."""
cluster_resources = ray.cluster_resources()
CPUs = int(cluster_resources["CPU"])
memory = round(cluster_resources["memory"] / (1000 ** 3))
GPUs = round(cluster_resources.get("GPU", 0))
print(f"CPUs = {CPUs}, memory = {memory}G, GPUs = {GPUs}")
@ray.remote(num_cpus=1, memory=1000 ** 3)
def preprocess(data):
time.sleep(1)
return get_instance_type()

with timer():
print(ray.get(preprocess.remote("data")))

t3.2xlarge
timer: took 2 seconds
print_cluster_resources()

from collections import Counter

with timer():
print(Counter(
ray.get([preprocess.remote(x) for x in range(60)])
))

CPUs = 4, memory = 9G, GPUs = 0

Counter({'t3.2xlarge': 60})
timer: took 16 seconds
from collections import Counter
with timer():
print(Counter(
ray.get([preprocess.remote(x) for x in range(6000)])
))

print_cluster_resources()

Counter({'m6a.48xlarge': 4443, 'm6a.32xlarge': 1362, 'c6id.4xlarge': 195})

timer: took 50 seconds
CPUs = 292, memory = 442G, GPUs = 0
@ray.remote(memory=100 * 1000 ** 3)
def preprocess_big_data():
return get_instance_type()

print(ray.get(preprocess_big_data.remote()))
print_cluster_resources()

i4i.8xlarge
CPUs = 34, memory = 197G, GPUs = 0
@ray.remote(num_gpus=4, accelerator_type="p2")
def train():
return get_instance_type()

with timer():
print(ray.get(train.remote()))
print_cluster_resources()

p2.xlarge
timer: took 178 seconds
CPUs = 37, memory = 239G, GPUs = 1
@ray.remote(num_gpus=4, accelerator_type="p2")
def train():
return get_instance_type()

with timer():
print(ray.get(train.remote()))
print_cluster_resources()

(issues in Karpenter prevented this from actually working)

Should You Try This?

Choose Proven Technologies?

Demo Code:

github.com/vicyap/mlops-world-2022

Ray v2 Architecture
No ratings yet
Ray v2 Architecture
64 pages
Google Professional Machine Learning Engineer Updated Dumps
100% (1)
Google Professional Machine Learning Engineer Updated Dumps
54 pages
Architecture Instruction Set Extensions Programming Reference PDF
No ratings yet
Architecture Instruction Set Extensions Programming Reference PDF
211 pages
Getting Started Guide: Fixed Format Systems
100% (1)
Getting Started Guide: Fixed Format Systems
40 pages
PLC Timers
No ratings yet
PLC Timers
6 pages
Ict Hand Book: (Part One)
No ratings yet
Ict Hand Book: (Part One)
95 pages
Comp Tech Quarter 2 Module
No ratings yet
Comp Tech Quarter 2 Module
77 pages
Lenovo B460 Wistron LB46E
No ratings yet
Lenovo B460 Wistron LB46E
53 pages
Proposal Penawaran Dinas Kab-Kota - Tik
No ratings yet
Proposal Penawaran Dinas Kab-Kota - Tik
27 pages
Ece Vii DSP Algorithms Architecture 10ec751 Notes
No ratings yet
Ece Vii DSP Algorithms Architecture 10ec751 Notes
181 pages
A Practical Guide To Building High-Performance Computing Clusters
No ratings yet
A Practical Guide To Building High-Performance Computing Clusters
69 pages
Aspire 5735 5735z 5335
No ratings yet
Aspire 5735 5735z 5335
154 pages
EE-323 (3+1) Topics: Instrument Types & Chapter # 3 Department of Electrical Engineering, WEC
No ratings yet
EE-323 (3+1) Topics: Instrument Types & Chapter # 3 Department of Electrical Engineering, WEC
16 pages
E-Series: Netapp E-Series Storage Systems Mirroring Feature Guide
No ratings yet
E-Series: Netapp E-Series Storage Systems Mirroring Feature Guide
27 pages
DATA SHEET Advanced Graphical Interface, AGI 400 Series
No ratings yet
DATA SHEET Advanced Graphical Interface, AGI 400 Series
23 pages
Machine Learning For Cyber: Unit 1: Introduction
No ratings yet
Machine Learning For Cyber: Unit 1: Introduction
23 pages
User Guide: Pre-Requisites
No ratings yet
User Guide: Pre-Requisites
20 pages
Input/output Organisaion of Computer Architecture
No ratings yet
Input/output Organisaion of Computer Architecture
27 pages
Tc58teg 6dcjta00
No ratings yet
Tc58teg 6dcjta00
14 pages
Scale Machine Learning From Zero To Millions of Users - Praveen Jayamumar
No ratings yet
Scale Machine Learning From Zero To Millions of Users - Praveen Jayamumar
38 pages
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
No ratings yet
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
11 pages
EE1005 L01 Computers & Programming
No ratings yet
EE1005 L01 Computers & Programming
35 pages
Ray AIR Technical Whitepaper
No ratings yet
Ray AIR Technical Whitepaper
22 pages
C2 W4ok
No ratings yet
C2 W4ok
94 pages
AC5000SC: Did You Know?
No ratings yet
AC5000SC: Did You Know?
1 page
2020-CS-433 (AI Dacumentation)
No ratings yet
2020-CS-433 (AI Dacumentation)
270 pages
JOERI HERMANS Distributed Keras
No ratings yet
JOERI HERMANS Distributed Keras
23 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
Prestigio Solutions PDF
No ratings yet
Prestigio Solutions PDF
15 pages
Abc)
No ratings yet
Abc)
7 pages
اساسيات الحاسوب و ويندوز 10
No ratings yet
اساسيات الحاسوب و ويندوز 10
67 pages
Laptop Invoice
No ratings yet
Laptop Invoice
1 page
Audio Mikrofoni
No ratings yet
Audio Mikrofoni
6 pages
Deploy A Machine Learning Model As An API On AWS, Step by Step
No ratings yet
Deploy A Machine Learning Model As An API On AWS, Step by Step
12 pages
Enterasys - End of Service Life - C3 e B3
No ratings yet
Enterasys - End of Service Life - C3 e B3
4 pages
Yousef Udacity Deep Learning Part 3 CNN
No ratings yet
Yousef Udacity Deep Learning Part 3 CNN
253 pages
Quartus - Install 683472 812587
No ratings yet
Quartus - Install 683472 812587
103 pages
Feature Store
No ratings yet
Feature Store
19 pages
Ec 501: Advanced Microprocessor and Microcontroller
No ratings yet
Ec 501: Advanced Microprocessor and Microcontroller
2 pages
Machine Learning
No ratings yet
Machine Learning
102 pages
MICCAI Educational Challenge
No ratings yet
MICCAI Educational Challenge
3 pages
Building A Python Web Service With Ray
No ratings yet
Building A Python Web Service With Ray
37 pages
Intelligent Monitoring of IoT Devices Using Neural Networks
No ratings yet
Intelligent Monitoring of IoT Devices Using Neural Networks
3 pages
Cloud, Hard Disk.: A Control Software B Keyboard C Operating System
No ratings yet
Cloud, Hard Disk.: A Control Software B Keyboard C Operating System
5 pages
AYA Spec Sheet
No ratings yet
AYA Spec Sheet
2 pages
Robust Resource Scaling of Containerized Microservices With Probabilistic Machine Learning
No ratings yet
Robust Resource Scaling of Containerized Microservices With Probabilistic Machine Learning
10 pages
Deep Learning For Credit Risk 1713932406
No ratings yet
Deep Learning For Credit Risk 1713932406
13 pages
Accelerating AI Innovation
No ratings yet
Accelerating AI Innovation
18 pages
Unit 3 Iot II
No ratings yet
Unit 3 Iot II
12 pages
Job Aware Scheduling Algorithm For MapReduce Framework
No ratings yet
Job Aware Scheduling Algorithm For MapReduce Framework
6 pages
DL7 2
No ratings yet
DL7 2
11 pages
Implementing A Convolutional Neural Network CNN 1718899610
No ratings yet
Implementing A Convolutional Neural Network CNN 1718899610
10 pages
Canin Inkjet Basics Technology Part 2
No ratings yet
Canin Inkjet Basics Technology Part 2
40 pages
Untitled Document
No ratings yet
Untitled Document
4 pages
Ray Design Patterns - Google 文档
No ratings yet
Ray Design Patterns - Google 文档
26 pages
Sinan
No ratings yet
Sinan
15 pages
Potato Disease Classification Using CNN
No ratings yet
Potato Disease Classification Using CNN
21 pages
Doing More With Slurm Advanced Capabilities
No ratings yet
Doing More With Slurm Advanced Capabilities
31 pages
Automated Threat Hunting Micro SOC
No ratings yet
Automated Threat Hunting Micro SOC
5 pages
Ray OSS Datasheet - Final
No ratings yet
Ray OSS Datasheet - Final
6 pages
Work 1 Eprua Program
No ratings yet
Work 1 Eprua Program
5 pages
??????? ???????? ???????? ??????????
No ratings yet
??????? ???????? ???????? ??????????
6 pages
L 0016322362 PDF
No ratings yet
L 0016322362 PDF
30 pages
CS401 24 Assign 2 Template Fixed
No ratings yet
CS401 24 Assign 2 Template Fixed
11 pages
How To Deploy Machine Learning Models in Production As APIs
No ratings yet
How To Deploy Machine Learning Models in Production As APIs
2 pages
Lab 6
No ratings yet
Lab 6
15 pages
CNCF - Ai 2
No ratings yet
CNCF - Ai 2
21 pages
Assignment4 Supritha
No ratings yet
Assignment4 Supritha
9 pages
Aditya Joshi 23252595 Assign 5
No ratings yet
Aditya Joshi 23252595 Assign 5
7 pages
Serverservices - Gpu-Cluster (LME - WIKI)
No ratings yet
Serverservices - Gpu-Cluster (LME - WIKI)
4 pages
2023 24 Computer Architecture
No ratings yet
2023 24 Computer Architecture
2 pages
Assignment 10
No ratings yet
Assignment 10
18 pages
Apache Airflow
No ratings yet
Apache Airflow
10 pages
Ai Clusters Data Center@nettrain
No ratings yet
Ai Clusters Data Center@nettrain
34 pages
Final PPT PFD
No ratings yet
Final PPT PFD
30 pages
Adaptive Resource Allocation in Multiprogramming Systems
No ratings yet
Adaptive Resource Allocation in Multiprogramming Systems
13 pages
Experiment 10 1
No ratings yet
Experiment 10 1
3 pages
Adaptive Pricing and Online Scheduling For Distributed Machine Learning Jobs
No ratings yet
Adaptive Pricing and Online Scheduling For Distributed Machine Learning Jobs
18 pages
Code
No ratings yet
Code
20 pages
FA I - Unit5
No ratings yet
FA I - Unit5
11 pages
Implementation of CNN From The Scratch Using Python Golam Moktader Daiyan
No ratings yet
Implementation of CNN From The Scratch Using Python Golam Moktader Daiyan
5 pages
ThinkCentre M90t Gen 5
No ratings yet
ThinkCentre M90t Gen 5
4 pages
Week 9-Module 10 Build and Deploy ML Models
No ratings yet
Week 9-Module 10 Build and Deploy ML Models
27 pages
Constructing The 10x Efficiency of Cloud Native Ai Infrastructure Matsu Zha Ai Xia 10 Dyags Peter Pan Daocloud Xie Zuo Daocloud
No ratings yet
Constructing The 10x Efficiency of Cloud Native Ai Infrastructure Matsu Zha Ai Xia 10 Dyags Peter Pan Daocloud Xie Zuo Daocloud
48 pages
Breaking Boundaries Tacc As An Unified Cloud Native Infra For Ai HPC Wu Dui Zha Daeptaccai Hpcni Chang 27dya Shi Peter Pan Daocloud Kaiqiang Xu Hong Kong University of Science and Technology
No ratings yet
Breaking Boundaries Tacc As An Unified Cloud Native Infra For Ai HPC Wu Dui Zha Daeptaccai Hpcni Chang 27dya Shi Peter Pan Daocloud Kaiqiang Xu Hong Kong University of Science and Technology
33 pages
Applsci 12 08411 v2
No ratings yet
Applsci 12 08411 v2
20 pages
RaySummit'22 - Large Scale Deep Learning Training and Tuning With Ray at Uber
No ratings yet
RaySummit'22 - Large Scale Deep Learning Training and Tuning With Ray at Uber
41 pages
Interview Questions
No ratings yet
Interview Questions
21 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Victoryap - One Cluster To Rule Them All

Uploaded by

Victoryap - One Cluster To Rule Them All

Uploaded by

One Cluster to Rule

# ray.init() will use the local system to run a cluster

# ray.init accepts an address to connect to a cluster

from collections import Counter

CPUs = 4, memory = 9G, GPUs = 0

Counter({'m6a.48xlarge': 4443, 'm6a.32xlarge': 1362, 'c6id.4xlarge': 195})

(issues in Karpenter prevented this from actually working)

Choose Proven Technologies?

You might also like