0% found this document useful (0 votes)
44 views

With Serverless Computing

Uploaded by

mha1375.1635
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

With Serverless Computing

Uploaded by

mha1375.1635
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Cheaper and Faster: Distributed Deep Reinforcement Learning

with Serverless Computing


Hanfei Yu1 , Jian Li2 , Yang Hua3 , Xu Yuan4 , Hao Wang1
1
Louisiana State University 2 Stony Brook University 3 Queen’s University Belfast 4 University of Delaware
{hyu25, haowang}@lsu.edu, [email protected], [email protected], [email protected]

Abstract Actor Learner Running Idle but incur costs


Deep reinforcement learning (DRL) has gained immense suc- Idle

cess in many applications, including gaming AI, robotics, and


Synchronization
system scheduling. Distributed algorithms and architectures Round i Idle
have been vastly proposed (e.g., actor-learner architecture) to
accelerate DRL training with large-scale server-based clus- Idle
Trajectories
ters. However, training on-policy algorithms with the actor- t
learner architecture unavoidably induces resource wasting Update (b) Server-based training
due to synchronization between learners and actors, thus re- RL Model
sulting in significantly extra billing. As a promising alterna- f Released
tive, serverless computing naturally fits on-policy synchro- Distribute model t


nization and alleviates resource wasting in distributed DRL f Released
training with pay-as-you-go pricing. Yet, none has leveraged t
Round i+1
serverless computing to facilitate DRL training. This paper
proposes M INIONS RL, the first serverless distributed DRL f

training framework that aims to accelerate DRL training- t
and cost-efficiency with dynamic actor scaling. We prototype (a) One-round on-policy training (c) Serverless training
M INIONS RL on top of Microsoft Azure Container Instances
and evaluate it with popular DRL tasks from OpenAI Gym. Figure 1: Server-based v.s. serverless architectures.
Extensive experiments show that M INIONS RL reduces total
training time by up to 52% and training cost by 86% com-
pared to latest solutions.
Achiam et al. 2017; Wijmans et al. 2019) have emerged
as a prominent DRL algorithm family, fully leveraging the
Introduction actor-learner training architecture with distributed comput-
The success of AlphaGo (Silver et al. 2016) inspires vari- ing clusters (Gu et al. 2017).
ous deep reinforcement learning (DRL) applications, such To facilitate efficient learning in a distributed environment
as gaming AI (Vinyals et al. 2019; Berner et al. 2019), with consistent DRL policies, on-policy algorithms enforce
robotics (Ji et al. 2022; Thumm and Althoff 2022), system a synchronization process between learners and actors after
scheduling (Mao et al. 2022; Qiu et al. 2023), bioinformat- every training round, as Fig. 1(a) shows. Due to the stochas-
ics (Jumper et al. 2021), and large language model train- tic environment dynamics (e.g., game environments), some
ing (OpenAI 2023). DRL training is expensive, which takes actors might have episodes that end sooner, leading them to
numerous trials and errors, consuming countless computing finish rounds earlier and wait in idle for other actors. Addi-
resources and time. Thus, a few distributed DRL algorithms tionally, all actors remain idle during the policy update by
are proposed to parallelize and accelerate the training with the learner as Fig. 1(b) shows. However, these idle actors
multiple servers (Luo et al. 2020; Wijmans et al. 2019; Espe- significantly waste computing resources, amplifying train-
holt et al. 2018; Horgan et al. 2018; Kapturowski et al. 2018; ing costs with server-based clusters.
Hessel et al. 2018; Espeholt et al. 2020). The dilemma of server-based DRL training. State-of-the-
The actor-learner architecture represents one of the most art server-based approaches reserve a fixed number of work-
efficient distributed DRL training paradigms available (Luo ers (e.g., physical or cloud servers) for distributed DRL
et al. 2020; Espeholt et al. 2018, 2020). This approach training. These methods face two primary challenges: 1)
decouples the DRL agent’s responsibilities into two dis- their coarse-grained resource management (e.g., server-level
tinct roles: actors for data sampling and learners for pol- instead of CPU core-level) leaves idle actors’ resources
icy updates. On-policy algorithms (Schulman et al. 2017; unreleased; and 2) the prolonged server startup process
Copyright © 2024, Association for the Advancement of Artificial (minute-level) prevents efficient mitigation of DRL actors’
Intelligence (www.aaai.org). All rights reserved. idle time through frequent server toggling. Thus, we propose
Fixed Decrease Increase Preliminaries
40

Final rewards
# of actors

30
400 Actor-learner architecture
20 200 The actor-learner architecture is one of the most performant
10
0
and efficient approaches that attempt to scale and accelerate
0 20 40 0 20 40 DRL training. A3C (Mnih et al. 2016) first introduced a sim-
# of round # of round
(a) Actor scheduling (b) Final rewards
ple actor-leaner prototype. IMPALA (Espeholt et al. 2018)
proposed a standard actor-learner architecture with V-trace
Figure 2: Adjusting the number of actors when training Ope- correction for off-policy training. IMPACT (Luo et al. 2020)
nAI Gym CartPole-v1 (Brockman et al. 2016) with Proximal added a surrogate target network to the actor-learner archi-
Policy Optimization (PPO) (OpenAI 2017). tecture for stabilizing training performance. SEED RL (Es-
peholt et al. 2020) aimed to accelerate actor-learner archi-
tecture by centralizing actor inferences to GPUs.
to enable cheaper and faster distributed DRL with serverless
computing. Server-based v.s. Serverless DRL Training
Serverless computing and how it fits distributed DRL?
Serverless Computing, also known as Function-as-a-Service Server-based training platforms provide users with an en-
(FaaS), is a new cloud computing model that uses tire server with coarse-grained resources packed together.
lightweight containers as execution units. Unlike physical For example, the cheapest Azure cloud server equipped
clusters and traditional cloud computing that require tedious with a V100 GPU is Standard NC6s v3, bundled with
configuration, serverless computing packages and executes 6 CPU cores and 112GB memory. Instead, serverless com-
tasks (e.g., DRL actors and learner) as functions with in- puting executes tasks with lightweight containers, thus al-
stant toggling (i.e., sub-second level) and auto-scaling. Thus, lowing fine-grained resource provisioning with instant func-
serverless computing has been widely deployed to serve tion launch/release, which charges users by the amount of
computation-intensive applications, such as deep learn- resources (e.g., CPU/GPU and memory) only in actual exe-
ing (Ali et al. 2020; Carreira et al. 2019; Wang, Niu, and Li cution (e.g., second). Due to the unique features, serverless
2019; Yu et al. 2021, 2022) and scientific computing (Chard computing is particularly appealing for tasks that require
et al. 2020; Roy et al. 2022). Fig. 1(c) shows how server- elasticity and high concurrency, such as scientific comput-
less functions naturally accommodate the on-policy training ing (Chard et al. 2020; Roy et al. 2022) and distributed train-
process with on-and-off DRL actors and learners, which mit- ing (Wang, Niu, and Li 2019; Guo et al. 2022; Thorpe et al.
igates idle resources. 2021; Yu et al. 2021, 2022).
Leveraging serverless computing’s fine-grained resource
provisioning and instant execution, a fundamental question Motivating Dynamic Actor Scaling for DRL
arises—how to achieve faster and cheaper DRL training with
an appropriate number of concurrent actors in each round? One of the fundamental differences between DRL and su-
To answer this question, we propose M INIONS RL, the pervised learning is the training data. In supervised learning
first serverless DRL training framework, which dynamically tasks, training data is collected offline before the training
adjusts the number of actors according to the DRL training starts, whereas DRL tasks sample the training data online
progress. As the training proceeds, it takes varying volumes during the rollout of the current policy with actors. As the
of training data to advance neural network model quality training proceeds, neural networks tend to demand varying
in each round (Devarakonda, Naumov, and Garland 2017; volumes of training data in each round (Devarakonda, Nau-
McCandlish et al. 2018). In the actor-learner architecture, mov, and Garland 2017; McCandlish et al. 2018). Hence, the
the number of actors in each round determines the volume number of DRL actors dictates the amount of training data
of sampled training data, thus impacting the policy network sampled in each round, potentially influencing the efficiency
quality. This intuition leads us to design an intelligent sched- of DRL training and the quality of the policy.
uler that learns to perform dynamic actor scaling for each Fig. 2 uses a real-world experiment to show the poten-
training round to optimize the DRL policy quality with min- tial impact on policy quality when adjusting the number of
imal training time and costs. Our main contributions are as actors during DRL training. Fig. 2(a) shows the three ac-
follows: tor dynamic scaling strategies: 1) Fixed, which uses a fixed
• We propose M INIONS RL, the first distributed DRL train- number of actors, 2) Decrease, which decreases ten actors
ing framework based on serverless computing. every ten training rounds, and 3) Increase, which increases
• We design an intelligent scheduler that learns to scale out ten actors every ten rounds. Note that the three strategies
actors dynamically and accelerate distributed DRL train- are under the same actor budget (i.e., the cumulative number
ing with minimal costs. of total used actors is the same). Fig. 2(b) shows the differ-
• We evaluated M INIONS RL on an off-the-shelf serverless ent final rewards achieved by the three actor scaling strate-
testbed (i.e., Microsoft Azure). Experiments with Ope- gies, raising a fundamental question—given the flexibility
nAI Gym show that M INIONS RL reduces up to 52% to- and scalability of serverless computing, how to dynamically
tal training time and 86% costs, respectively. scale out actors for faster and cheaper DRL training?
M INIONS RL’s Design MinionsRL Scheduler
State st
<latexit sha1_base64="ASZtEO71OvfxGMKvYoyMXkZFhFQ=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzQOSJcxOZpMhs7PLTK8QlnyCFw+KePWLvPk3TpI9aGJBQ1HVTXdXkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB9PHfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpFWrelfVy/uLSr2Wx1GEEziFc/DgGupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2gOjdg=</latexit>

h
<latexit sha1_base64="cyDchg6sHk1svpVgIIiEe1FPBVY=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY8FLx4rmLbQxrLZbtqlm03YnQil9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSqFQdf9dgobm1vbO8Xd0t7+weFR+fikZZJMM+6zRCa6E1LDpVDcR4GSd1LNaRxK3g7Ht3O//cS1EYl6wEnKg5gOlYgEo2glv5eKx1G/XHGr7gJknXg5qUCOZr/81RskLIu5QiapMV3PTTGYUo2CST4r9TLDU8rGdMi7lioacxNMF8fOyIVVBiRKtC2FZKH+npjS2JhJHNrOmOLIrHpz8T+vm2F0E0yFSjPkii0XRZkkmJD552QgNGcoJ5ZQpoW9lbAR1ZShzadkQ/BWX14nrVrVu6rW7+uVRi2PowhncA6X4ME1NOAOmuADAwHP8ApvjnJenHfnY9lacPKZU/gD5/MHxpyOoA==</latexit>

Overview Scheduler Policy ⇡


Action at
<latexit sha1_base64="LSu9LCVbNEAKYZLLqp/6e39knks=">AAAB6XicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzQOSJcxOZpMhs7PLTK8QlnyCBy8qXv0ij/6Nk2QPmljQUFR1090VJFIYdN1vp7C2vrG5Vdwu7ezu7R+UD49aJk41400Wy1h3Amq4FIo3UaDknURzGgWSt4Px7cxvP3FtRKwecZJwP6JDJULBKFrpgfaxX664VXcOskq8nFQgR6Nf/uoNYpZGXCGT1Jiu5yboZ1SjYJJPS73U8ISyMR3yrqWKRtz42fzUKTmzyoCEsbalkMzV3xMZjYyZRIHtjCiOzLI3E//zuimGN34mVJIiV2yxKEwlwZjM/iYDoTlDObGEMi3srYSNqKYMbTo2A2/541XSqlW9q+rl/UWlXsvTKMIJnMI5eHANdbiDBjSBwRCe4RXenLHz4rw7H4vWgpPPHMMfOJ8/yP+Ntg==</latexit>

To answer this question, we propose M INIONS RL, which <latexit sha1_base64="pkK28evxjGKFGG0QOzpvpmFpCjg=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoMgCGE3+DoGvHiMYB6QLGF2MpsMmZ1dZnqFsOQjPHhR8er3ePRvnCR70MSChqKqm+6uIJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2M72Z++4lrI2L1iJOE+xEdKhEKRtFKbdPP8MKb9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NbPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU3IZuAtf7xKWrWqd129eris1Gt5GkU4gVM4Bw9uoA730IAmMBjDM7zCmxM7L86787FoLTj5zDH8gfP5A4FBj0Q=</latexit>

st+1
refactors the actor-learner DRL architecture into indepen- Reward rt at+1 rt+1
<latexit sha1_base64="85r+BvQX4qN9vvvM5c/49o6Ti9I=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzQOSJcxOZpMhs7PLTK8QlnyCFw+KePWLvPk3TpI9aGJBQ1HVTXdXkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB93HfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpFWrelfVy/uLSr2Wx1GEEziFc/DgGupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2aIjdc=</latexit>

<latexit sha1_base64="/TjuIpSDKIbXCNXXAsdJ8erNQ68=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoMgCGE3+DoGvHiMYB6QLGF2MpsMmZ1dZnqFsOQjPHhR8er3ePRvnCR70MSChqKqm+6uIJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2M72Z++4lrI2L1iJOE+xEdKhEKRtFKbdrP8MKb9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NbPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU3IZuAtf7xKWrWqd129eris1Gt5GkU4gVM4Bw9uoA730IAmMBjDM7zCmxM7L86787FoLTj5zDH8gfP5A2WfjzI=</latexit> <latexit sha1_base64="zeFHbXTbX/Zg9dC05UPrEFkVV+M=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoMgCGE3+DoGvHiMYB6QLGF2MpsMmZ1dZnqFsOQjPHhR8er3ePRvnCR70MSChqKqm+6uIJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2M72Z++4lrI2L1iJOE+xEdKhEKRtFKbd3P8MKb9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NbPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU3IZuAtf7xKWrWqd129eris1Gt5GkU4gVM4Bw9uoA730IAmMBjDM7zCmxM7L86787FoLTj5zDH8gfP5A3+4j0M=</latexit>

dent serverless functions with fine-grained resource man- ⇡tw


<latexit sha1_base64="SW9wgf6N5UQDCrTyddVgn2OKR6w=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgxbAbfB0DXjxGMA/YrGF2MpsMmZ1ZZnqVsOQzvHhQxKtf482/cZLsQRMLGoqqbrq7wkRwA6777RRWVtfWN4qbpa3tnd298v5By6hUU9akSijdCYlhgkvWBA6CdRLNSBwK1g5HN1O//ci04UrewzhhQUwGkkecErCS3014L4Mzb/Lw1CtX3Ko7A14mXk4qKEejV/7q9hVNYyaBCmKM77kJBBnRwKlgk1I3NSwhdEQGzLdUkpiZIJudPMEnVunjSGlbEvBM/T2RkdiYcRzazpjA0Cx6U/E/z08hug4yLpMUmKTzRVEqMCg8/R/3uWYUxNgSQjW3t2I6JJpQsCmVbAje4svLpFWrepfVi7vzSr2Wx1FER+gYnSIPXaE6ukUN1EQUKfSMXtGbA86L8+58zFsLTj5ziP7A+fwBEv2RFQ==</latexit>

⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>

⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>

w
<latexit sha1_base64="OkC4Z9Dmrhf9W9xw/2xZ7PTdy6M=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoMgCGE3+DoGvHiMYB6wWcPsZDYZMjuzzPQqYclnePGgiFe/xpt/4yTZgyYWNBRV3XR3hYngBlz32ymsrK6tbxQ3S1vbO7t75f2DllGppqxJlVC6ExLDBJesCRwE6ySakTgUrB2ObqZ++5Fpw5W8h3HCgpgMJI84JWAlv5vwXgZn3uThqVeuuFV3BrxMvJxUUI5Gr/zV7SuaxkwCFcQY33MTCDKigVPBJqVualhC6IgMmG+pJDEzQTY7eYJPrNLHkdK2JOCZ+nsiI7Ex4zi0nTGBoVn0puJ/np9CdB1kXCYpMEnni6JUYFB4+j/uc80oiLElhGpub8V0SDShYFMq2RC8xZeXSatW9S6rF3fnlXotj6OIjtAxOkUeukJ1dIsaqIkoUugZvaI3B5wX5935mLcWnHzmEP2B8/kDD+2REw==</latexit>

⇡t+1 …
1
agement. M INIONS RL aims to instantly launch necessary Policy Policy
number of actors for faster DRL training, while promptly ⇡tw
<latexit sha1_base64="SW9wgf6N5UQDCrTyddVgn2OKR6w=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgxbAbfB0DXjxGMA/YrGF2MpsMmZ1ZZnqVsOQzvHhQxKtf482/cZLsQRMLGoqqbrq7wkRwA6777RRWVtfWN4qbpa3tnd298v5By6hUU9akSijdCYlhgkvWBA6CdRLNSBwK1g5HN1O//ci04UrewzhhQUwGkkecErCS3014L4Mzb/Lw1CtX3Ko7A14mXk4qKEejV/7q9hVNYyaBCmKM77kJBBnRwKlgk1I3NSwhdEQGzLdUkpiZIJudPMEnVunjSGlbEvBM/T2RkdiYcRzazpjA0Cx6U/E/z08hug4yLpMUmKTzRVEqMCg8/R/3uWYUxNgSQjW3t2I6JJpQsCmVbAje4svLpFWrepfVi7vzSr2Wx1FER+gYnSIPXaE6ukUN1EQUKfSMXtGbA86L8+58zFsLTj5ziP7A+fwBEv2RFQ==</latexit>

Weights ⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>

Weights
1 Sampled Sampled ✓
✓t
<latexit sha1_base64="V284OS6T+t5QE0L06fly098na0U=">AAAB8nicbVDLSgNBEOyNrxhfUY9eBoMgCGE3+DoGvHiMYB6QXcLsZJIMmZ1dZnqFsOQ3PHhR8erPePRvnCR70MSChqKqm+6uMJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1LDpVC8iQIl7ySa0yiUvB2O72Z++4lrI2L1iJOEBxEdKjEQjKKVfB9HHGkvwwtv2itX3Ko7B1klXk4qkKPRK3/5/ZilEVfIJDWm67kJBhnVKJjk05KfGp5QNqZD3rVU0YibIJvfPCVnVumTQaxtKSRz9fdERiNjJlFoOyOKI7PszcT/vG6Kg9sgEypJkSu2WDRIJcGYzAIgfaE5QzmxhDIt7K2EjaimDG1MNgNv+eNV0qpVvevq1cNlpV7L0yjCCZzCOXhwA3W4hwY0gUECz/AKbw46L86787FoLTj5zDH8gfP5A1DFkXU=</latexit>

<latexit sha1_base64="ZUG8K8VqXvw+HuSZFXeB5x/DXRI=">AAAB7nicbVDJSgNBEK1xjXGLevTSGARPYSa4HQNePEYwCyRD6OnUJE16FrprhBDyEx68qHj1dzz6N3aSOWjig4LHe1VU1QtSJQ257reztr6xubVd2Cnu7u0fHJaOjpsmybTAhkhUotsBN6hkjA2SpLCdauRRoLAVjO5mfusJtZFJ/EjjFP2ID2IZSsHJSu0uDZF4j3qlsltx52CrxMtJGXLUe6Wvbj8RWYQxCcWN6XhuSv6Ea5JC4bTYzQymXIz4ADuWxjxC40/m907ZuVX6LEy0rZjYXP09MeGRMeMosJ0Rp6FZ9mbif14no/DWn8g4zQhjsVgUZopRwmbPs77UKEiNLeFCS3srE0OuuSAbkc3AW/54lTSrFe+6cvVwWa5V8zQKcApncAEe3EAN7qEODRCg4Ble4c1JnRfn3flYtK45+cwJ/IHz+QOruY/5</latexit>

releasing idle actor and learner functions to optimize cost- Data ⌧t Data ⌧t+1 t+1
<latexit sha1_base64="iBA4x6dG3/PdjSyJHxV/uiecmg0=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYBAEIewGX8eAF48RzAOTJcxOZpMhs7PLTK8QlvyFBy8qXv0bj/6Nk2QPmljQUFR1090VJFIYdN1vZ2V1bX1js7BV3N7Z3dsvHRw2TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj26nfeuLaiFg94DjhfkQHSoSCUbTSYxdp2svw3Jv0SmW34s5AlomXkzLkqPdKX91+zNKIK2SSGtPx3AT9jGoUTPJJsZsanlA2ogPesVTRiBs/m108IadW6ZMw1rYUkpn6eyKjkTHjKLCdEcWhWfSm4n9eJ8Xwxs+ESlLkis0XhakkGJPp+6QvNGcox5ZQpoW9lbAh1ZShDclm4C1+vEya1Yp3Vbm8vyjXqnkaBTiGEzgDD66hBndQhwYwUPAMr/DmaOfFeXc+5q0rTj5zBH/gfP4AydqQlQ==</latexit>

<latexit sha1_base64="CMOWaFVzuZ4rzpJfnbF/Gw56PW8=">AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzAOSJcxOZpMxs7PLTK8QlvyDBy8qXv0fj/6Nk2QPmljQUFR1090VJFIYdN1vp7C2vrG5Vdwu7ezu7R+UD49aJk41400Wy1h3Amq4FIo3UaDknURzGgWSt4Px7cxvP3FtRKwecJJwP6JDJULBKFqp1UOa9rFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NopObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYY3fiZUkiJXbLEoTCXBmMxeJwOhOUM5sYQyLeythI2opgxtQDYDb/njVdKqVb2r6uX9RaVey9Mowgmcwjl4cA11uIMGNIHBIzzDK7w5ynlx3p2PRWvByWeO4Q+czx8oHY8Z</latexit>

⇡tw ⇡tw
<latexit sha1_base64="SW9wgf6N5UQDCrTyddVgn2OKR6w=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgxbAbfB0DXjxGMA/YrGF2MpsMmZ1ZZnqVsOQzvHhQxKtf482/cZLsQRMLGoqqbrq7wkRwA6777RRWVtfWN4qbpa3tnd298v5By6hUU9akSijdCYlhgkvWBA6CdRLNSBwK1g5HN1O//ci04UrewzhhQUwGkkecErCS3014L4Mzb/Lw1CtX3Ko7A14mXk4qKEejV/7q9hVNYyaBCmKM77kJBBnRwKlgk1I3NSwhdEQGzLdUkpiZIJudPMEnVunjSGlbEvBM/T2RkdiYcRzazpjA0Cx6U/E/z08hug4yLpMUmKTzRVEqMCg8/R/3uWYUxNgSQjW3t2I6JJpQsCmVbAje4svLpFWrepfVi7vzSr2Wx1FER+gYnSIPXaE6ukUN1EQUKfSMXtGbA86L8+58zFsLTj5ziP7A+fwBEv2RFQ==</latexit>

<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>

efficiency. Serverless computing’s pay-as-you-run nature 1 <latexit sha1_base64="h6EIqyPsUa1hLDC31P2K1beV8Ks=">AAAB7XicjVDLSgMxFL1TX7W+qi7dBIvgqkzF567gxmUF+4B2KJnMnTY0kwxJRihDP8KFGxW3fo9L/8b0sVBR8EDgcM653JsTpoIb6/sfXmFpeWV1rbhe2tjc2t4p7+61jMo0wyZTQulOSA0KLrFpuRXYSTXSJBTYDkfXU799j9pwJe/sOMUgoQPJY86odVK7J1w0ov1ypVb1ZyB/kwos0OiX33uRYlmC0jJBjenW/NQGOdWWM4GTUi8zmFI2ogPsOippgibIZ+dOyJFTIhIr7Z60ZKZ+nchpYsw4CV0yoXZofnpT8Tevm9n4Msi5TDOLks0XxZkgVpHp30nENTIrxo5Qprm7lbAh1ZRZ19D/OmidVGvn1bPb00r9atFGEQ7gEI6hBhdQhxtoQBMYjOABnuDZU96j9+K9zqMFbzGzD9/gvX0CvIKPcQ==</latexit>

frees M INIONS RL from expenses on stopped functions, thus


⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>

reducing unnecessary monetary costs throughout the train- Actor function


ing process. Serverless computing provides agile scalabil- ⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>

<latexit sha1_base64="h6EIqyPsUa1hLDC31P2K1beV8Ks=">AAAB7XicjVDLSgMxFL1TX7W+qi7dBIvgqkzF567gxmUF+4B2KJnMnTY0kwxJRihDP8KFGxW3fo9L/8b0sVBR8EDgcM653JsTpoIb6/sfXmFpeWV1rbhe2tjc2t4p7+61jMo0wyZTQulOSA0KLrFpuRXYSTXSJBTYDkfXU799j9pwJe/sOMUgoQPJY86odVK7J1w0ov1ypVb1ZyB/kwos0OiX33uRYlmC0jJBjenW/NQGOdWWM4GTUi8zmFI2ogPsOippgibIZ+dOyJFTIhIr7Z60ZKZ+nchpYsw4CV0yoXZofnpT8Tevm9n4Msi5TDOLks0XxZkgVpHp30nENTIrxo5Qprm7lbAh1ZRZ19D/OmidVGvn1bPb00r9atFGEQ7gEI6hBhdQhxtoQBMYjOABnuDZU96j9+K9zqMFbzGzD9/gvX0CvIKPcQ==</latexit>

ity so that M INIONS RL can dynamically scale the number Learner function
of actors in real-time as needed. Specifically, M INIONS RL
aims to address two primary challenges: Figure 3: M INIONS RL’s architecture.
Incorporate characteristics of DRL tasks. DRL training
significantly differs from other ML training, for example, the
recurrent interaction and online data sampling. To achieve based on the sampled data. At the end of round k, the cumu-
high performance and low cost, it’s necessary to incorporate lative reward achieved by the agent is represented as jk . Let
a
M INIONS RL’s scheduling with awareness of DRL workload Pk,i and Pkl denote the execution time of the ith actor and
characteristics. However, existing machine learning sched- learner function in round k, where each actor and learner
ulers are not designed for distributed DRL training (Guo function is allocated with da and dl resources, respectively.
et al. 2022; Wang, Niu, and Li 2019; Carreira et al. 2019), We use c to represent the unit price of executing a function
thus their tricks are not directly applicable. with a unit resource for one second. Thus, the duration Pk
Solution: The training process of M INIONS RL is de- and cost Ck of round k in on-policy training is given by
signed to be DRL objective and constraint-aware. To capture
Pk := Pkl + max{Pk,i
a
}, (1)
unique characteristics of DRL workloads, we embed criti- i
cal features into the states of M INIONS RL’s agent, such as Ik
X
the average final rewards and Kullback–Leibler (KL) diver- Ck := c Pkl dl + a a

Pk,i d . (2)
gence. The reward function of M INIONS RL’s agent is also i=1
crafted with awareness of the momentary budget and work- PK
load actor performance, guiding M INIONS RL to search for The goal is to minimize the training duration k=1 Pk
PK
optimal scheduling decisions through training. via Eq. 1 while the cost k=1 Ck via Eq. 2 subjects to a
Trade-off between training performance and cost. It’s monetary budget B, by deciding Ik in each round:
ambiguous to determine how many actors should be
K 
launched in each round to hit a sweet spot between train- X 
ing performance and cost. Moreover, it’s difficult to infer min Pkl + max{Pk,i
a
} , (3)
Ik i
the complicated dependency between actor scheduling and k=1

policy updates, further escalating the challenge. s.t. K ≥ 1, jk ≥ J,


Solution: We formulate the actor scheduling of dis- Ik
X
Pkl dl a a

tributed DRL training as a sequential decision problem and c + Pk,i d ≤ B. (4)
analyze the complexity. We devise a DRL-based scheduler i=1
to dynamically scale actors by learning from experiences.
The optimization problem is a challenging sequential de-
cision problem with an exponential complexity of O(I K )
Problem Formulation for searching optima. Exhaustively enumerating the optimal
We consider a general RL training setting—an agent contin- solution is unrealistic due to the need for countless retrain-
uously interacts with the environment to learn a policy that ing. What’s more, the complex correlation between actor
maximizes cumulative rewards. The training proceeds in the scheduling and policy update further escalates the difficulty
actor-learner fashion as shown in Fig. 3. The training is ter- of solving the problem. Therefore, we resort to DRL itself—
minated when the agent achieves target final rewards J or using a DRL agent to learn how to optimally schedule actors
runs out of a monetary budget B. Let fk,i be the actor func- for distributed DRL training workloads.
tion i scheduled for sampling trajectories in round k, where
i ∈ {1, . . . , Ik }, k ∈ {1, . . . , K} that Ik and K denote the DRL-based Actor Scheduler
total number of training rounds to reach final reward J and Fig. 3 depicts the architecture of M INIONS RL with a two-
the total number of actor functions in round k, respectively. fold workflow: 1) the DRL workload that trains in actor-
When all actor functions are terminated after sampling, one learner fashion, and 2) the DRL-based actor scheduler that
learner function is launched to learn and update the policy manages the DRL workload training. At the beginning of
Table 1: Hyperparameters of PPO used in the training work- Table 2: Total training time and costs for six tasks.
loads and the search ranges of the scheduler.
Environment Baseline Time (s) Cost ($)
Parameter Workload Scheduler M INIONS RL 241 ± 24 1.2 ± 0.2
Azure ML 277 ± 21 4.5 ± 0.5
Learning rate 0.00005 [0.001, 0.005, 0.01] Hopper IMPACT 291 ± 26 4.6 ± 0.8
Discount factor (γ) 0.99 0.99 M INIONS RL-Adapt 403 ± 45 4.4 ± 0.6
Mini-batch size 256 [1, 2, 4] M INIONS RL-Max 232 ± 19 1.7 ± 0.3
Clip parameter 0.3 [0.1, 0.2, 0.3]
KL coefficient 0.2 0.0 M INIONS RL 334 ± 42 1.3 ± 0.4
Humanoid Azure ML 464 ± 56 9.3 ± 1.2
KL target 0.01 [0.005, 0.01, 0.015]
IMPACT 436 ± 57 2.9 ± 0.7
Entropy coefficient 0.0 [0.005, 0.01, 0.015]
Value function coefficient 1.0 [0.1, 0.3, 0.5, 0.7] M INIONS RL 220 ± 29 1.1 ± 0.3
HalfCheetah Azure ML 458 ± 49 2.9 ± 0.6
IMPACT 193 ± 12 3.0 ± 0.8
M INIONS RL 2295 ± 337 6.6 ± 0.9
each round k, the scheduler takes an action on deciding how Gravitar Azure ML 2902 ± 481 11.4 ± 1.5
many actors Ik should be launched for evaluation and data IMPACT 3375 ± 714 12.0 ± 1.7
sampling, based on the state collected from the leaner up- M INIONS RL 1787 ± 229 7.8 ± 1.2
date. The action made by the scheduler is judged by a per- SpaceInvaders Azure ML 2260 ± 343 26.9 ± 3.1
round reward from the actors. We describe the design of IMPACT 2628 ± 402 25.8 ± 2.4
states, actions, and rewards in our actor scheduler as follows: M INIONS RL 506 ± 59 2.3 ± 0.8
State. The state is represented by a flat vector sk = Azure ML 872 ± 68 6.0 ± 1.0
l
(k, Lk−1 , R̄k−1 , DkKL , Pk−1 a
, P̄k−1 , Pk−1 , bk ). Specifically, Qbert IMPACT 768 ± 66 6.4 ± 1.3
M INIONS RL-Adapt 750 ± 61 2.0 ± 0.7
Lk−1 and R̄k−1 are the loss value of the learner and average M INIONS RL-Max 484 ± 33 5.1 ± 1.2
final rewards of actors evaluated
 from
 the previous round.
KL
P πk (a|s)
Dk := a πk (a|s) log πk−1 (a|s) denotes the KL diver-
gence of two consecutive workload policies πkw and πk−1 w
, penalties. Thus, we guide M INIONS RL’s scheduler to over-
which is commonly employed to measure the difference be- come failures by minimizing the gap (maxk R̄k − J) while
tween two policies (Achiam et al. 2017; Schulman et al. aiming to speed up the workload training.
2017, 2015). We include Lk−1 , R̄k−1 , and DkKL in the state Note that both training time and cost are considered in
to provide the scheduler insights about how the learner pol- the problem formulation, where training time is our direct
l
icy updates. Recall that Pk−1 and Pk−1 represent the ex- optimization objective (Eq. 3), and training cost is a hard
ecution time of the learner and the total duration of train- constraint for M INIONS RL (Eq. 4). M INIONS RL supports
a
ing round k − 1, respectively. Additionally, P̄k−1 represents minimizing cost by slightly changing the reward function.
the execution time averaged over actors from the previous We assume the common practice is to optimize training per-
round, and bk represents the budget remaining after training formance under a given monetary budget.
of the current round. The scheduler leverages the above met-
rics to adjust the decisions during the scheduling process.
Action. At the beginning of round k, the scheduler out- Training M INIONS RL’s Scheduler
puts an action ak := Ik , a scalar value selected within
[1, Imax ] ∈ Z+ , where Imax is the maximum number of ac- We employ the famous PPO algorithm (Schulman et al.
tors that we can allocate per round. The scheduler chooses 2017) to train M INIONS RL’s scheduler. Table 1 character-
action ak under the guidance of its policy π h (θ). izes the hyperparameters and search ranges of PPO used in
Reward. The reward returned at the end of round k is de- M INIONS RL. We employed Ray-Tune (Liaw et al. 2018)
fined as rk := −βPk , where β ∈ (0, 1) is a reward coef- to efficiently search for optimal hyperparameters within the
ficient. The cumulative reward through K rounds is given ranges. The lightweight policy and critic networks in M IN -
PK IONS RL’s scheduler are constructed by two fully-connected
by − k=1 γ t βPk , where γ ∈ (0, 1). Intuitively, the longer layers of 64 hidden units with Tanh activation. We follow
the workload takes to finish training (either actor evaluation existing DRL-driven scheduling works (Mao et al. 2019; Qiu
reaching target final reward J or running out of budget B), et al. 2020, 2022, 2023) to use Tanh for simple neural archi-
the more we will penalize the scheduler. Additionally, we tectures. MinioinsRL also supports other activation units.
define the reward of the end round K as We update the parameters of the scheduler policy using the

−βPK R̄K ≥ J and bk ≥ 0, Adam optimizer (Kingma and Ba 2014) with a learning rate
rK :=
−βPK + (maxk R̄k − J) otherwise. of 0.005. M INIONS RL is trained with 100 episodes per task.
We add an additional term to the reward at round K to judge
the overall performance of M INIONS RL’s scheduler. If the Evaluation
scheduler fails, i.e., actor evaluation always fails to reach
target final reward J and runs out of budget B, the term We prototype and evaluate M INIONS RL on top of
(maxk R̄k −J < 0) penalizes the scheduler with negative re- ACI (Azure Container Instances 2022) and Ray li-
turns. Further, lower actor evaluation performance gets more brary (Moritz et al. 2018).
Final rewards 600 600

Final rewards

Final rewards

Final rewards
400 400
400 400
200 200
200 200
0 0
0 0
0 20 40 0 100 200 0 50 0 2000
# of round Wall clock time (s) # of round Wall clock time (s)
(a) Hopper-v3 (d) GravitarNoFrameskip-v4

600 600 300 300


Final rewards

Final rewards

Final rewards

Final rewards
400 400 200 200
100 100
200 200
0 0
0 0
0 50 0 200 400 0 50 0 1000
# of round Wall clock time (s) # of round Wall clock time (s)
(b) Humanoid-v3
(e) SpaceInvadersNoFrameskip-v4

300 300
Final rewards

Final rewards

Final rewards

Final rewards
500 500
200 200
0 0 MinionsRL
100 100 Azure ML
−500 −500 0 0 IMPACT

0 20 40 60 0 100 200 0 20 0 1000


# of round Wall clock time (s) # of round Wall clock time (s)
(c) HalfCheetah-v3 (f) QbertNoFrameskip-v4

Figure 4: M INIONS RL outperforms baselines on statistical and time efficiency for continuous and discrete control tasks.

Experimental Setup parameter settings of PPO used in training workloads. We


Testbeds. We deploy all server-based baselines to a clus- used the default hyperparameters from Ray-RLlib (Liang
ter of Azure VMs: one Standard NC6s v3 virtual ma- et al. 2018) for Mujoco and Atari tasks. It is fair to compare
chine (VM) and four Standard E16-8s v5 VMs. The baselines as long as using the same tasks. While evaluating
cluster contains one NVIDIA V100 GPU and four 8-core In- on the six tasks, our solution is broadly applicable to DRL
tel Xeon Platinum CPUs (in total 32 cores) for training DRL workloads with any reinforcement learning (RL) training al-
workloads. M INIONS RL is prototyped on Azure Container gorithms and environments.
Instances (ACI) (Azure Container Instances 2022). When
training DRL workloads with M INIONS RL, according to Comparisons with Baselines
our workload profiling, each learner container is configured We compare M INIONS RL with two server-based baselines:
with one V100 GPU and each actor container is with one 1) Azure ML (Azure Machine Learning 2022) is a state-of-
CPU core, respectively. We limit the actor allocation range the-practice, ML-as-a-Service platform that provides rapid
of M INIONS RL within [1, 32] during every training round. model deployment and training. Despite waiving the deploy-
Workloads. Six environments from OpenAI Gym are used ment and startup costs of DRL workloads, users are still
to evaluate M INIONS RL and other baselines, including charged with resource idle time during workload training,
three continuous-action MuJoCo environments (Hopper- as demonstrated in Fig. 1. We implement the distributed
v3, Humanoid-v3, and HalfCheetah-v3) and three discrete- PPO training method using the testbed cluster on Azure ML.
action Atari environments (SpaceInvadersNoFrameskip-v4, 2) IMPACT (Luo et al. 2020) is a state-of-the-art actor-
QbertNoFrameskip-v4, and GravitarNoFrameskip-v4). For learner training architecture. It builds on a long list of im-
MuJoCo, the policy network consists of two fully-connected provements over PPO and combines various tricks for asyn-
layers of 256 hidden units with Tanh activation. For Atari, chronous training, such as V-trace importance sampling (Es-
the policy network consists of three convolutional layers of peholt et al. 2018) and the surrogate target network (Lillicrap
8×8, 4×4, and 11×11 kernel sizes with ReLU activation, re- et al. 2015). We consider IMPACT to investigate how Min-
spectively. The input sampled from Atari games is a stack of ionsRL compares with off-policy architectures.
three 84×84 images. In both cases, the critic networks share Final rewards. Fig. 4 shows the final rewards averaged over
the same architecture as the policy networks. Due to superior five times of repeated experiments, each with a different ran-
performance and popularity (OpenAI 2017), we use PPO as dom seed, for three continuous and three discrete control
the learner policy optimizer (shown in Fig. 3) for the above tasks, respectively. M INIONS RL and baselines are stopped
workloads in the evaluation. Table 1 describes the hyper- if reaching the same target final reward or running out of
Cumulative rewards

Cumulative rewards
600 300 600 600

Final rewards

Final rewards
200
400
400 400
100
200
0 200 200
0
40 A B C 40 A B C 0 0
0 100 0 500
# of rounds Wall clock time (s)
# of actors

# of actors
(a) Hopper-v3
20 20

300 300

Final rewards

Final rewards
0 0 200 200
0 20 40 0 10 20 30
# of round # of round MinionsRL
100 100
(a) Hopper-v3 (b) QbertNoFrameskip-v4 MinionsRL-Adapt
0 MinionsRL-Max 0
Figure 5: M INIONS RL’s actor scheduling decisions on two 0 50 0 2000
tasks. M INIONS RL dynamically schedules actors to balance # of round Wall clock time (s)
(b) QbertNoFrameskip-v4
training performance and cost.
Figure 6: Ablation study of M INIONS RL with its two vari-
ants: M INIONS RL-Adapt and M INIONS RL-Max.
the same budget. The performance variation is subtle from
the beginning and gradually increases as training proceeds.
The variation drops at the final parts because some of the
five experiments have ended earlier (either reaching desired Ablation Study
rewards or running out of budget). Thus, only one or two
experiments proceed to further rounds/timestamps, leaving
less variation—zero variation at the end if only one experi- To verify the effectiveness of two key components: server-
ment remains. The results show that M INIONS RL is more less functions and DRL-based scheduler, we compare M IN -
efficient in transforming the monetary budget into train- IONS RL with two variants of itself: 1) M INIONS RL-Max
ing time. Under the same budget, M INIONS RL trains much statically launches all 32 actors in every training round, and
faster than Azure ML and IMPACT in statistical efficiency 2) M INIONS RL-Adapt schedules actors with a naive, re-
and wall clock time with similar or better performance. ward ratio-based scheduler. Let J be the target final reward
Training cost. Table 2 reports the total training time and and Imax be the maximum number of available actors per
costs when baselines reached the same final rewards. Com- round. Let Jˆk denote approximated final reward that the
pared to Azure ML and IMPACT, M INIONS RL reduces learner policy can achieve at round k, which is computed us-
training time and costs up to 52% and 86%, respectively. ing a moving window averaged over the last n rounds given
Pk−1
by Jˆk := x=k−n−1 Jx . M INIONS RL-Adapt schedules a
Actor Scheduling set of actor functions Ik proportional to the ratio of reward
ˆ
Jˆ and J, which is given by Ik := clip(1, JJk Imax , Imax ).
We record and report how M INIONS RL makes actor This naive scheduler follows the intuition that a better policy
scheduling decisions to investigate the rationale behind the may produce better data, so we proportionally allocate more
performance gain compared with the baselines. Fig. 5 de- actors when the policy quality is higher. We set the moving
picts the number of actors M INIONS RL schedules and final window size n = 5 in the evaluation.
rewards per round on Hopper-v3 and QbertNoFrameSkip-
v4, respectively. We use A, B, and C for convenience when Final rewards. Fig. 6 shows the final rewards averaged
referring to the three phases of decisions made by M IN - over five times of repeated experiments for Hopper-v3 and
IONS RL in Fig. 5. For Hopper-v3, M INIONS RL launches QbertNoFrameskip-v4, respectively. By comparing M IN -
more actors at the beginning of Phase A to boost training IONS RL with M INIONS RL-Max, we can observe that M IN -
and gradually decreases the number of actors to save cost IONS RL’s DRL-based scheduler can preserve similar or
when performance steadies in Phase B and C. More actors better training efficiency while saving actor costs. Note
are launched by M INIONS RL at the end of Phase C to ex- that M INIONS RL-Max also runs the same DRL tasks with
plore optimal performance. We observe similar results on serverless functions. When comparing M INIONS RL with
QbertNoFrameSkip-v4, where M INIONS RL boosts training M INIONS RL-Adapt, the results demonstrate that M INION -
with more actors in Phase A and B, and reduces actors in S RL’s DRL-guided scheduler makes better decisions on ac-
steady Phase C to save cost. tor scheduling than the naive ratio-based scheduler.
In contrast to two baselines (i.e., Azure ML and IMPACT) Training cost. Table 2 shows M INIONS RL’s the total train-
that launch a fixed number of actors for every round, M IN - ing time and costs and the two variants when reaching the
IONS RL dynamically schedules actors throughout the train- same final rewards. Compared to M INIONS RL-Max, M IN -
ing process to strike a balance between training performance IONS RL significantly reduces training cost by up to 44%
and cost, thus completing training tasks cheaper and faster. while completing training with a similar duration.
Round Completion time (s) Hopper HalfCheetah SpaceInvaders tation load to the learner.
Humanoid Gravitar Qbert
50
Breakdown
20
Latency breakdown. Fig. 8(a) and (b) characterize the la-
10
tency breakdown of interaction between actor and learner
5
function in M INIONS RL’s one-round training. Launching
0 5 10 15 20 25 30 35 an actor and learner function takes around 300 and 1500 ms
# of actors (attaching GPUs to the learner container takes more time),
respectively. We further eliminate the startup overhead by
Figure 7: Scalability of M INIONS RL with respect to the function pre-warming.
number of actors in six environments. Communication overheads. M INIONS RL uses the efficient
gRPC library to enable lightweight communication between
actor and learner functions . Fig. 8(a) and (b) show the com-
Qbert SpaceInvaders Gravitar
Hopper

Actor Actor
Learner munication overhead between actor and learner functions.
Learner For (continuous) Mujoco environments, transferring 65,536
Actor Startup timesteps between actor and learner function incurs less than
HalfCheetah Humanoid

Actor Execution 100 ms communication overhead. For (discrete) Atari envi-


Communication
Learner
ronments, the overhead is less than 800 ms for 6,144 stacked
Learner frames. The communication overheads are trivial compared
Actor
to the end-to-end training time per round.
Actor

Learner Learner
Scheduler Training Overhead Mitigation
0 2 4 6 8 0 10 20 30
Wall clock time (s) Wall clock time (s) M INIONS RL trains the scheduler for each DRL task,
(a) Continuous Environments (b) Discrete Environments which may lead to high overheads. For example, training
a scheduler for Humanoid-v3/SpaceInvadersNoFrameskip-
Figure 8: Latency breakdown of interaction between actor v4 from scratch took around 10/50 hours. We further in-
and learner function in M INIONS RL’s one-round training. vestigate mitigating such overheads by fine-tuning a trained
scheduler of one task to other tasks. Fine-tuning M IN -
IONS RL is feasible since different DRL tasks have the
600 600
Final rewards

Final rewards

same observation and action shapes as input and out-


400 400 put sizes to M INIONS RL’s scheduler networks. Fig. 9
200 Scratch 200
presents the performance of training M INIONS RL from
Fine-tune scratch and fine-tuning from another task. We fine-tune
0 0 trained schedulers of Hopper-v3 and QbertNoFrameskip-
0 50 0 200 400
# of round Wall clock time (s) v4 to Humanoid-v3 and SpaceInvadersNoFrameskip-v4, re-
(a) Humanoid-v3 spectively. Fine-tuning each scheduler took ten episodes
while achieving similar or better performance than training
Final rewards

Final rewards

400 400 from scratch. More importantly, fine-tuning drastically re-


duces scheduler training time and cost. It only took around
200 200
one/four hours to fine-tune a scheduler for Humanoid-
0 0 v3/SpaceInvadersNoFrameskip-v4, reducing the scheduler
0 50 0 1000 2000 training time and cost by 90%.
# of round Wall clock time (s)
(b) SpaceInvadersNoFrameskip-v4
Conclusion
Figure 9: Training the scheduler from scratch v.s. fine-tuning We proposed M INIONS RL, the first distributed DRL train-
the scheduler trained from a different task. ing framework based on serverless computing. By leverag-
ing serverless computing, M INIONS RL enables agile auto-
scaling and fine-grained resource provisioning to exten-
Scalability sively mitigate resource wasting during distributed DRL
Fig. 7 illustrates M INIONS RL’s scalability using the same training. To accelerate training- and cost-efficiency, we de-
testbed. The total completion time of one-round DRL train- signed a DRL-driven scheduler to seek the optimal num-
ing increases as the number of actors increases. Training ber of actors by learning the fundamental trade-off between
time for Atari environments (i.e., GravitarNoFrameskip-v4, training performance and cost. We evaluated M INIONS RL
SpaceInvadersNoFrameskip-v4, and QbertNoFrameskip- on realistic clusters with popular tasks from OpenAI Gym.
v4) has a larger increase rate than Mujoco environments Experimental results show that M INIONS RL outperforms
(i.e., Hopper-v3, Humanoid-v3, HalfCheetah-v3), because state-of-the-art and state-of-the-practice solutions by reduc-
processing stacked frames brings significantly more compu- ing up to 52% total training time and 86% training cost.
Acknowledgements et al. 2018. IMPALA: Scalable Distributed Deep-RL with
The work of H. Yu and H. Wang was supported in part Importance Weighted Actor-Learner Architectures. In In-
by the National Science Foundation (NSF) grants 2153502, ternational Conference on Machine Learning (ICML).
2315612, 2327480, and the AWS Cloud Credit for Research Gu, S. S.; Lillicrap, T.; Turner, R. E.; Ghahramani, Z.;
program. The work of J. Li was supported in part by the NSF Schölkopf, B.; and Levine, S. 2017. Interpolated Policy Gra-
grants 2148309 and 2315614, and the U.S. Army Research dient: Merging On-Policy and Off-Policy Gradient Estima-
Office (ARO) grant W911NF-23-1-0072. The work of X. tion for Deep Reinforcement Learning. Advances in Neural
Yuan was supported in part by the NSF grants 2019511, Information Processing Systems (NIPS).
2348452, and 2315613. Results presented in this paper were Guo, R.; Guo, V.; Kim, A.; Hildred, J.; and Daudjee, K.
obtained using CloudBank (Norman et al. 2021), supported 2022. Hydrozoa: Dynamic Hybrid-Parallel DNN Training
by the NSF award 1925001. Any opinions, findings, and on Serverless Containers. Proceedings of Machine Learn-
conclusions or recommendations expressed in this material ing and Systems (MLSys).
are those of the authors and do not necessarily reflect the Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostro-
views of the funding agencies. vski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; and Sil-
ver, D. 2018. Rainbow: Combining Improvements in Deep
References Reinforcement Learning. In Thirty-second AAAI conference
Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con- on artificial intelligence (AAAI).
strained Policy Optimization. In International Conference Horgan, D.; Quan, J.; Budden, D.; Barth-Maron, G.; Hessel,
on Machine Learning (ICML). M.; Van Hasselt, H.; and Silver, D. 2018. Distributed Priori-
Ali, A.; Pinciroli, R.; Yan, F.; and Smirni, E. 2020. BATCH: tized Experience Replay. arXiv preprint arXiv:1803.00933.
Machine Learning Inference Serving on Serverless Plat- Ji, Y.; Li, Z.; Sun, Y.; Peng, X. B.; Levine, S.; Berseth, G.;
forms with Adaptive Batching. In International Conference and Sreenath, K. 2022. Hierarchical Reinforcement Learn-
for High Performance Computing, Networking, Storage and ing for Precise Soccer Shooting Skills using a Quadrupedal
Analysis (SC). IEEE. Robot. In 2022 IEEE/RSJ International Conference on In-
Azure Container Instances. 2022. Azure Con- telligent Robots and Systems (IROS).
tainer Instances. https://azure.microsoft.com/en- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.;
us/products/container-instances/. [Online; accessed Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žı́dek,
1-Jan-2022]. A.; Potapenko, A.; et al. 2021. Highly Accurate Protein
Azure Machine Learning. 2022. Azure Machine Learn- Structure Prediction with AlphaFold. Nature.
ing. https://azure.microsoft.com/en-us/products/machine- Kapturowski, S.; Ostrovski, G.; Quan, J.; Munos, R.; and
learning/. [Online; accessed 1-Jan-2022]. Dabney, W. 2018. Recurrent Experience Replay in Dis-
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; tributed Reinforcement Learning. In International Confer-
Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; ence on Learning Representations (ICLR).
et al. 2019. Dota 2 with Large Scale Deep Reinforcement Kingma, D. P.; and Ba, J. 2014. Adam: A Method for
Learning. arXiv preprint arXiv:1912.06680. Stochastic Optimization. arXiv preprint arXiv:1412.6980.
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Gold-
Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI berg, K.; Gonzalez, J.; Jordan, M.; and Stoica, I. 2018. RL-
Gym. arXiv preprint arXiv:1606.01540. lib: Abstractions for Distributed Reinforcement Learning. In
Carreira, J.; Fonseca, P.; Tumanov, A.; Zhang, A.; and Katz, International Conference on Machine Learning (ICML).
R. 2019. Cirrus: a Serverless Framework for End-to-end Liaw, R.; Liang, E.; Nishihara, R.; Moritz, P.; Gonzalez,
ML Workflows. In Proceedings of the ACM Symposium on J. E.; and Stoica, I. 2018. Tune: A Research Platform for
Cloud Computing (SoCC). Distributed Model Selection and Training. arXiv preprint
Chard, R.; Babuji, Y.; Li, Z.; Skluzacek, T.; Woodard, A.; arXiv:1807.05118.
Blaiszik, B.; Foster, I.; and Chard, K. 2020. FuncX: A Fed- Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;
erated Function Serving Fabric for Science. In Proc. of the Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous
29th International Symposium on High-performance Paral- Control with Deep Reinforcement Learning. arXiv preprint
lel and Distributed Computing (HPDC), 65–76. arXiv:1509.02971.
Devarakonda, A.; Naumov, M.; and Garland, M. 2017. Ad- Luo, M.; Yao, J.; Liaw, R.; Liang, E.; and Stoica, I. 2020.
aBatch: Adaptive Batch Sizes for Training Deep Neural Net- IMPACT: Importance Weighted Asynchronous Architec-
works. arXiv preprint arXiv:1712.02029. tures with Clipped Target Networks. In International Con-
Espeholt, L.; Marinier, R.; Stanczyk, P.; Wang, K.; and ference on Learning Representations (ICLR).
Michalski, M. 2020. Seed RL: Scalable and Efficient Deep- Mao, H.; Schwarzkopf, M.; Venkatakrishnan, S. B.; Meng,
RL with Accelerated Central Inference. In International Z.; and Alizadeh, M. 2019. Learning Scheduling Algo-
Conference on Learning Representations (ICLR). rithms for Data Processing Clusters. In Proceedings of the
Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, ACM Special Interest Group on Data Communication (SIG-
V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; COMM).
Mao, W.; Qiu, H.; Wang, C.; Franke, H.; Kalbarczyk, Z.; Game of Go with Deep Neural Networks and Tree Search.
Iyer, R.; and Basar, T. 2022. A Mean-field Game Approach Nature.
to Cloud Resource Management with Function Approxima- Thorpe, J.; Qiao, Y.; Eyolfson, J.; Teng, S.; Hu, G.; Jia,
tion. Advances in Neural Information Processing Systems Z.; Wei, J.; Vora, K.; Netravali, R.; Kim, M.; et al. 2021.
(NIPS). Dorylus: Affordable, Scalable, and Accurate GNN Training
McCandlish, S.; Kaplan, J.; Amodei, D.; and Team, O. D. with Distributed CPU Servers and Serverless Threads. In
2018. An Empirical Model of Large-batch Training. arXiv USENIX Symposium on Operating Systems Design and Im-
preprint arXiv:1812.06162. plementation (OSDI).
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Thumm, J.; and Althoff, M. 2022. Provably Safe Deep Rein-
Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn- forcement Learning for Robotic Manipulation in Human En-
chronous Methods for Deep Reinforcement Learning. In In- vironments. In 2022 International Conference on Robotics
ternational Conference on Machine Learning (ICML). and Automation (ICRA).
Moritz, P.; Nishihara, R.; Wang, S.; Tumanov, A.; Liaw, R.; Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;
Liang, E.; Elibol, M.; Yang, Z.; Paul, W.; Jordan, M. I.; et al. Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.;
2018. Ray: A Distributed Framework for Emerging AI Ap- Georgiev, P.; et al. 2019. Grandmaster Level in StarCraft II
plications. In 13th USENIX Symposium on Operating Sys- Using Multi-agent Reinforcement Learning. Nature.
tems Design and Implementation (OSDI). Wang, H.; Niu, D.; and Li, B. 2019. Distributed Machine
Norman, M.; Kellen, V.; Smallen, S.; DeMeulle, B.; Strande, Learning with a Serverless Architecture. In IEEE 2019 Con-
S.; Lazowska, E.; Alterman, N.; Fatland, R.; Stone, S.; Tan, ference on Computer Communications (INFOCOM).
A.; et al. 2021. CloudBank: Managed Services to Simplify Wijmans, E.; Kadian, A.; Morcos, A.; Lee, S.; Essa, I.;
Cloud Access for Computer Science Research and Educa- Parikh, D.; Savva, M.; and Batra, D. 2019. DD-PPO:
tion. In Practice and Experience in Advanced Research Learning Near-Perfect PointGoal Navigators from 2.5 Bil-
Computing (PEARC). lion Frames. arXiv preprint arXiv:1911.00357.
OpenAI. 2017. Proximal Policy Optimization. https:// Yu, H.; Wang, H.; Li, J.; Yuan, X.; and Park, S.-J. 2022.
openai.com/blog/openai-baselines-ppo/. [Online; accessed Accelerating serverless computing by harvesting idle re-
1-Jan-2022]. sources. In Proceedings of the ACM Web Conference 2022,
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. 1741–1751.
Qiu, H.; Banerjee, S. S.; Jha, S.; Kalbarczyk, Z. T.; and Iyer, Yu, M.; Jiang, Z.; Ng, H. C.; Wang, W.; Chen, R.; and Li,
R. K. 2020. FIRM: An Intelligent Fine-grained Resource B. 2021. Gillis: Serving Large Neural Networks in Server-
Management Framework for SLO-Oriented Microservices. less Functions with Automatic Model Partitioning. In 2021
In 14th USENIX Symposium on Operating Systems Design IEEE 41st International Conference on Distributed Comput-
and Implementation (OSDI). ing Systems (ICDCS). IEEE.
Qiu, H.; Mao, W.; Patke, A.; Wang, C.; Franke, H.; Kalbar-
czyk, Z. T.; Başar, T.; and Iyer, R. K. 2022. SIMPPO: A
Scalable and Incremental Online Learning Framework for
Serverless Resource Management. In Proceedings of the
13th Symposium on Cloud Computing.
Qiu, H.; Mao, W.; Wang, C.; Franke, H.; Youssef, A.;
Kalbarczyk, Z. T.; Basar, T.; and Iyer, R. K. 2023.
AWARE: Automate Workload Autoscaling with Reinforce-
ment Learning in Production Cloud Systems. In 2023
USENIX Annual Technical Conference (USENIX ATC).
Roy, R. B.; Patel, T.; Gadepally, V.; and Tiwari, D. 2022.
Mashup: Making Serverless Computing Useful for HPC
Workflows via Hybrid Execution. In Proceedings of the 27th
ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, 46–60.
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz,
P. 2015. Trust Region Policy Optimization. In International
Conference on Machine Learning (ICML).
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Klimov, O. 2017. Proximal Policy Optimization Algorithms.
arXiv preprint arXiv:1707.06347.
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;
Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the

You might also like