With Serverless Computing
With Serverless Computing
…
Synchronization
system scheduling. Distributed algorithms and architectures Round i Idle
have been vastly proposed (e.g., actor-learner architecture) to
accelerate DRL training with large-scale server-based clus- Idle
Trajectories
ters. However, training on-policy algorithms with the actor- t
learner architecture unavoidably induces resource wasting Update (b) Server-based training
due to synchronization between learners and actors, thus re- RL Model
sulting in significantly extra billing. As a promising alterna- f Released
tive, serverless computing naturally fits on-policy synchro- Distribute model t
…
nization and alleviates resource wasting in distributed DRL f Released
training with pay-as-you-go pricing. Yet, none has leveraged t
Round i+1
serverless computing to facilitate DRL training. This paper
proposes M INIONS RL, the first serverless distributed DRL f
…
training framework that aims to accelerate DRL training- t
and cost-efficiency with dynamic actor scaling. We prototype (a) One-round on-policy training (c) Serverless training
M INIONS RL on top of Microsoft Azure Container Instances
and evaluate it with popular DRL tasks from OpenAI Gym. Figure 1: Server-based v.s. serverless architectures.
Extensive experiments show that M INIONS RL reduces total
training time by up to 52% and training cost by 86% com-
pared to latest solutions.
Achiam et al. 2017; Wijmans et al. 2019) have emerged
as a prominent DRL algorithm family, fully leveraging the
Introduction actor-learner training architecture with distributed comput-
The success of AlphaGo (Silver et al. 2016) inspires vari- ing clusters (Gu et al. 2017).
ous deep reinforcement learning (DRL) applications, such To facilitate efficient learning in a distributed environment
as gaming AI (Vinyals et al. 2019; Berner et al. 2019), with consistent DRL policies, on-policy algorithms enforce
robotics (Ji et al. 2022; Thumm and Althoff 2022), system a synchronization process between learners and actors after
scheduling (Mao et al. 2022; Qiu et al. 2023), bioinformat- every training round, as Fig. 1(a) shows. Due to the stochas-
ics (Jumper et al. 2021), and large language model train- tic environment dynamics (e.g., game environments), some
ing (OpenAI 2023). DRL training is expensive, which takes actors might have episodes that end sooner, leading them to
numerous trials and errors, consuming countless computing finish rounds earlier and wait in idle for other actors. Addi-
resources and time. Thus, a few distributed DRL algorithms tionally, all actors remain idle during the policy update by
are proposed to parallelize and accelerate the training with the learner as Fig. 1(b) shows. However, these idle actors
multiple servers (Luo et al. 2020; Wijmans et al. 2019; Espe- significantly waste computing resources, amplifying train-
holt et al. 2018; Horgan et al. 2018; Kapturowski et al. 2018; ing costs with server-based clusters.
Hessel et al. 2018; Espeholt et al. 2020). The dilemma of server-based DRL training. State-of-the-
The actor-learner architecture represents one of the most art server-based approaches reserve a fixed number of work-
efficient distributed DRL training paradigms available (Luo ers (e.g., physical or cloud servers) for distributed DRL
et al. 2020; Espeholt et al. 2018, 2020). This approach training. These methods face two primary challenges: 1)
decouples the DRL agent’s responsibilities into two dis- their coarse-grained resource management (e.g., server-level
tinct roles: actors for data sampling and learners for pol- instead of CPU core-level) leaves idle actors’ resources
icy updates. On-policy algorithms (Schulman et al. 2017; unreleased; and 2) the prolonged server startup process
Copyright © 2024, Association for the Advancement of Artificial (minute-level) prevents efficient mitigation of DRL actors’
Intelligence (www.aaai.org). All rights reserved. idle time through frequent server toggling. Thus, we propose
Fixed Decrease Increase Preliminaries
40
Final rewards
# of actors
30
400 Actor-learner architecture
20 200 The actor-learner architecture is one of the most performant
10
0
and efficient approaches that attempt to scale and accelerate
0 20 40 0 20 40 DRL training. A3C (Mnih et al. 2016) first introduced a sim-
# of round # of round
(a) Actor scheduling (b) Final rewards
ple actor-leaner prototype. IMPALA (Espeholt et al. 2018)
proposed a standard actor-learner architecture with V-trace
Figure 2: Adjusting the number of actors when training Ope- correction for off-policy training. IMPACT (Luo et al. 2020)
nAI Gym CartPole-v1 (Brockman et al. 2016) with Proximal added a surrogate target network to the actor-learner archi-
Policy Optimization (PPO) (OpenAI 2017). tecture for stabilizing training performance. SEED RL (Es-
peholt et al. 2020) aimed to accelerate actor-learner archi-
tecture by centralizing actor inferences to GPUs.
to enable cheaper and faster distributed DRL with serverless
computing. Server-based v.s. Serverless DRL Training
Serverless computing and how it fits distributed DRL?
Serverless Computing, also known as Function-as-a-Service Server-based training platforms provide users with an en-
(FaaS), is a new cloud computing model that uses tire server with coarse-grained resources packed together.
lightweight containers as execution units. Unlike physical For example, the cheapest Azure cloud server equipped
clusters and traditional cloud computing that require tedious with a V100 GPU is Standard NC6s v3, bundled with
configuration, serverless computing packages and executes 6 CPU cores and 112GB memory. Instead, serverless com-
tasks (e.g., DRL actors and learner) as functions with in- puting executes tasks with lightweight containers, thus al-
stant toggling (i.e., sub-second level) and auto-scaling. Thus, lowing fine-grained resource provisioning with instant func-
serverless computing has been widely deployed to serve tion launch/release, which charges users by the amount of
computation-intensive applications, such as deep learn- resources (e.g., CPU/GPU and memory) only in actual exe-
ing (Ali et al. 2020; Carreira et al. 2019; Wang, Niu, and Li cution (e.g., second). Due to the unique features, serverless
2019; Yu et al. 2021, 2022) and scientific computing (Chard computing is particularly appealing for tasks that require
et al. 2020; Roy et al. 2022). Fig. 1(c) shows how server- elasticity and high concurrency, such as scientific comput-
less functions naturally accommodate the on-policy training ing (Chard et al. 2020; Roy et al. 2022) and distributed train-
process with on-and-off DRL actors and learners, which mit- ing (Wang, Niu, and Li 2019; Guo et al. 2022; Thorpe et al.
igates idle resources. 2021; Yu et al. 2021, 2022).
Leveraging serverless computing’s fine-grained resource
provisioning and instant execution, a fundamental question Motivating Dynamic Actor Scaling for DRL
arises—how to achieve faster and cheaper DRL training with
an appropriate number of concurrent actors in each round? One of the fundamental differences between DRL and su-
To answer this question, we propose M INIONS RL, the pervised learning is the training data. In supervised learning
first serverless DRL training framework, which dynamically tasks, training data is collected offline before the training
adjusts the number of actors according to the DRL training starts, whereas DRL tasks sample the training data online
progress. As the training proceeds, it takes varying volumes during the rollout of the current policy with actors. As the
of training data to advance neural network model quality training proceeds, neural networks tend to demand varying
in each round (Devarakonda, Naumov, and Garland 2017; volumes of training data in each round (Devarakonda, Nau-
McCandlish et al. 2018). In the actor-learner architecture, mov, and Garland 2017; McCandlish et al. 2018). Hence, the
the number of actors in each round determines the volume number of DRL actors dictates the amount of training data
of sampled training data, thus impacting the policy network sampled in each round, potentially influencing the efficiency
quality. This intuition leads us to design an intelligent sched- of DRL training and the quality of the policy.
uler that learns to perform dynamic actor scaling for each Fig. 2 uses a real-world experiment to show the poten-
training round to optimize the DRL policy quality with min- tial impact on policy quality when adjusting the number of
imal training time and costs. Our main contributions are as actors during DRL training. Fig. 2(a) shows the three ac-
follows: tor dynamic scaling strategies: 1) Fixed, which uses a fixed
• We propose M INIONS RL, the first distributed DRL train- number of actors, 2) Decrease, which decreases ten actors
ing framework based on serverless computing. every ten training rounds, and 3) Increase, which increases
• We design an intelligent scheduler that learns to scale out ten actors every ten rounds. Note that the three strategies
actors dynamically and accelerate distributed DRL train- are under the same actor budget (i.e., the cumulative number
ing with minimal costs. of total used actors is the same). Fig. 2(b) shows the differ-
• We evaluated M INIONS RL on an off-the-shelf serverless ent final rewards achieved by the three actor scaling strate-
testbed (i.e., Microsoft Azure). Experiments with Ope- gies, raising a fundamental question—given the flexibility
nAI Gym show that M INIONS RL reduces up to 52% to- and scalability of serverless computing, how to dynamically
tal training time and 86% costs, respectively. scale out actors for faster and cheaper DRL training?
M INIONS RL’s Design MinionsRL Scheduler
State st
<latexit sha1_base64="ASZtEO71OvfxGMKvYoyMXkZFhFQ=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzQOSJcxOZpMhs7PLTK8QlnyCFw+KePWLvPk3TpI9aGJBQ1HVTXdXkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB9PHfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpFWrelfVy/uLSr2Wx1GEEziFc/DgGupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2gOjdg=</latexit>
h
<latexit sha1_base64="cyDchg6sHk1svpVgIIiEe1FPBVY=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY8FLx4rmLbQxrLZbtqlm03YnQil9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSqFQdf9dgobm1vbO8Xd0t7+weFR+fikZZJMM+6zRCa6E1LDpVDcR4GSd1LNaRxK3g7Ht3O//cS1EYl6wEnKg5gOlYgEo2glv5eKx1G/XHGr7gJknXg5qUCOZr/81RskLIu5QiapMV3PTTGYUo2CST4r9TLDU8rGdMi7lioacxNMF8fOyIVVBiRKtC2FZKH+npjS2JhJHNrOmOLIrHpz8T+vm2F0E0yFSjPkii0XRZkkmJD552QgNGcoJ5ZQpoW9lbAR1ZShzadkQ/BWX14nrVrVu6rW7+uVRi2PowhncA6X4ME1NOAOmuADAwHP8ApvjnJenHfnY9lacPKZU/gD5/MHxpyOoA==</latexit>
st+1
refactors the actor-learner DRL architecture into indepen- Reward rt at+1 rt+1
<latexit sha1_base64="85r+BvQX4qN9vvvM5c/49o6Ti9I=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzQOSJcxOZpMhs7PLTK8QlnyCFw+KePWLvPk3TpI9aGJBQ1HVTXdXkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB93HfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpFWrelfVy/uLSr2Wx1GEEziFc/DgGupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2aIjdc=</latexit>
⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>
⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>
w
<latexit sha1_base64="OkC4Z9Dmrhf9W9xw/2xZ7PTdy6M=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoMgCGE3+DoGvHiMYB6wWcPsZDYZMjuzzPQqYclnePGgiFe/xpt/4yTZgyYWNBRV3XR3hYngBlz32ymsrK6tbxQ3S1vbO7t75f2DllGppqxJlVC6ExLDBJesCRwE6ySakTgUrB2ObqZ++5Fpw5W8h3HCgpgMJI84JWAlv5vwXgZn3uThqVeuuFV3BrxMvJxUUI5Gr/zV7SuaxkwCFcQY33MTCDKigVPBJqVualhC6IgMmG+pJDEzQTY7eYJPrNLHkdK2JOCZ+nsiI7Ex4zi0nTGBoVn0puJ/np9CdB1kXCYpMEnni6JUYFB4+j/uc80oiLElhGpub8V0SDShYFMq2RC8xZeXSatW9S6rF3fnlXotj6OIjtAxOkUeukJ1dIsaqIkoUugZvaI3B5wX5935mLcWnHzmEP2B8/kDD+2REw==</latexit>
⇡t+1 …
1
agement. M INIONS RL aims to instantly launch necessary Policy Policy
number of actors for faster DRL training, while promptly ⇡tw
<latexit sha1_base64="SW9wgf6N5UQDCrTyddVgn2OKR6w=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgxbAbfB0DXjxGMA/YrGF2MpsMmZ1ZZnqVsOQzvHhQxKtf482/cZLsQRMLGoqqbrq7wkRwA6777RRWVtfWN4qbpa3tnd298v5By6hUU9akSijdCYlhgkvWBA6CdRLNSBwK1g5HN1O//ci04UrewzhhQUwGkkecErCS3014L4Mzb/Lw1CtX3Ko7A14mXk4qKEejV/7q9hVNYyaBCmKM77kJBBnRwKlgk1I3NSwhdEQGzLdUkpiZIJudPMEnVunjSGlbEvBM/T2RkdiYcRzazpjA0Cx6U/E/z08hug4yLpMUmKTzRVEqMCg8/R/3uWYUxNgSQjW3t2I6JJpQsCmVbAje4svLpFWrepfVi7vzSr2Wx1FER+gYnSIPXaE6ukUN1EQUKfSMXtGbA86L8+58zFsLTj5ziP7A+fwBEv2RFQ==</latexit>
Weights ⇡tw
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>
Weights
1 Sampled Sampled ✓
✓t
<latexit sha1_base64="V284OS6T+t5QE0L06fly098na0U=">AAAB8nicbVDLSgNBEOyNrxhfUY9eBoMgCGE3+DoGvHiMYB6QXcLsZJIMmZ1dZnqFsOQ3PHhR8erPePRvnCR70MSChqKqm+6uMJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1LDpVC8iQIl7ySa0yiUvB2O72Z++4lrI2L1iJOEBxEdKjEQjKKVfB9HHGkvwwtv2itX3Ko7B1klXk4qkKPRK3/5/ZilEVfIJDWm67kJBhnVKJjk05KfGp5QNqZD3rVU0YibIJvfPCVnVumTQaxtKSRz9fdERiNjJlFoOyOKI7PszcT/vG6Kg9sgEypJkSu2WDRIJcGYzAIgfaE5QzmxhDIt7K2EjaimDG1MNgNv+eNV0qpVvevq1cNlpV7L0yjCCZzCOXhwA3W4hwY0gUECz/AKbw46L86787FoLTj5zDH8gfP5A1DFkXU=</latexit>
<latexit sha1_base64="ZUG8K8VqXvw+HuSZFXeB5x/DXRI=">AAAB7nicbVDJSgNBEK1xjXGLevTSGARPYSa4HQNePEYwCyRD6OnUJE16FrprhBDyEx68qHj1dzz6N3aSOWjig4LHe1VU1QtSJQ257reztr6xubVd2Cnu7u0fHJaOjpsmybTAhkhUotsBN6hkjA2SpLCdauRRoLAVjO5mfusJtZFJ/EjjFP2ID2IZSsHJSu0uDZF4j3qlsltx52CrxMtJGXLUe6Wvbj8RWYQxCcWN6XhuSv6Ea5JC4bTYzQymXIz4ADuWxjxC40/m907ZuVX6LEy0rZjYXP09MeGRMeMosJ0Rp6FZ9mbif14no/DWn8g4zQhjsVgUZopRwmbPs77UKEiNLeFCS3srE0OuuSAbkc3AW/54lTSrFe+6cvVwWa5V8zQKcApncAEe3EAN7qEODRCg4Ble4c1JnRfn3flYtK45+cwJ/IHz+QOruY/5</latexit>
releasing idle actor and learner functions to optimize cost- Data ⌧t Data ⌧t+1 t+1
<latexit sha1_base64="iBA4x6dG3/PdjSyJHxV/uiecmg0=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYBAEIewGX8eAF48RzAOTJcxOZpMhs7PLTK8QlvyFBy8qXv0bj/6Nk2QPmljQUFR1090VJFIYdN1vZ2V1bX1js7BV3N7Z3dsvHRw2TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj26nfeuLaiFg94DjhfkQHSoSCUbTSYxdp2svw3Jv0SmW34s5AlomXkzLkqPdKX91+zNKIK2SSGtPx3AT9jGoUTPJJsZsanlA2ogPesVTRiBs/m108IadW6ZMw1rYUkpn6eyKjkTHjKLCdEcWhWfSm4n9eJ8Xwxs+ESlLkis0XhakkGJPp+6QvNGcox5ZQpoW9lbAh1ZShDclm4C1+vEya1Yp3Vbm8vyjXqnkaBTiGEzgDD66hBndQhwYwUPAMr/DmaOfFeXc+5q0rTj5zBH/gfP4AydqQlQ==</latexit>
<latexit sha1_base64="CMOWaFVzuZ4rzpJfnbF/Gw56PW8=">AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKewGX8eAF48RzAOSJcxOZpMxs7PLTK8QlvyDBy8qXv0fj/6Nk2QPmljQUFR1090VJFIYdN1vp7C2vrG5Vdwu7ezu7R+UD49aJk41400Wy1h3Amq4FIo3UaDknURzGgWSt4Px7cxvP3FtRKwecJJwP6JDJULBKFqp1UOa9rFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NopObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYY3fiZUkiJXbLEoTCXBmMxeJwOhOUM5sYQyLeythI2opgxtQDYDb/njVdKqVb2r6uX9RaVey9Mowgmcwjl4cA11uIMGNIHBIzzDK7w5ynlx3p2PRWvByWeO4Q+czx8oHY8Z</latexit>
⇡tw ⇡tw
<latexit sha1_base64="SW9wgf6N5UQDCrTyddVgn2OKR6w=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgxbAbfB0DXjxGMA/YrGF2MpsMmZ1ZZnqVsOQzvHhQxKtf482/cZLsQRMLGoqqbrq7wkRwA6777RRWVtfWN4qbpa3tnd298v5By6hUU9akSijdCYlhgkvWBA6CdRLNSBwK1g5HN1O//ci04UrewzhhQUwGkkecErCS3014L4Mzb/Lw1CtX3Ko7A14mXk4qKEejV/7q9hVNYyaBCmKM77kJBBnRwKlgk1I3NSwhdEQGzLdUkpiZIJudPMEnVunjSGlbEvBM/T2RkdiYcRzazpjA0Cx6U/E/z08hug4yLpMUmKTzRVEqMCg8/R/3uWYUxNgSQjW3t2I6JJpQsCmVbAje4svLpFWrepfVi7vzSr2Wx1FER+gYnSIPXaE6ukUN1EQUKfSMXtGbA86L8+58zFsLTj5ziP7A+fwBEv2RFQ==</latexit>
<latexit sha1_base64="vNKMtoSEPR85qjgvQV/rj8wnFhs=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xjw4jGCeUCyhtnJbDJkdmeZ6VXCko/w4EXFq9/j0b9xkuxBEwsaiqpuuruCRAqDrvvtrKyurW9sFraK2zu7e/ulg8OmUalmvMGUVLodUMOliHkDBUreTjSnUSB5KxjdTP3WI9dGqPgexwn3IzqIRSgYRSu1uono4cNTr1R2K+4MZJl4OSlDjnqv9NXtK5ZGPEYmqTEdz03Qz6hGwSSfFLup4QllIzrgHUtjGnHjZ7NzJ+TUKn0SKm0rRjJTf09kNDJmHAW2M6I4NIveVPzP66QYXvuZiJMUeczmi8JUElRk+jvpC80ZyrEllGlhbyVsSDVlaBOyGXiLHy+TZrXiXVYu7s7LtWqeRgGO4QTOwIMrqMEt1KEBDEbwDK/w5ijnxXl3PuatK04+cwR/4Hz+AOaNj4c=</latexit>
<latexit sha1_base64="h6EIqyPsUa1hLDC31P2K1beV8Ks=">AAAB7XicjVDLSgMxFL1TX7W+qi7dBIvgqkzF567gxmUF+4B2KJnMnTY0kwxJRihDP8KFGxW3fo9L/8b0sVBR8EDgcM653JsTpoIb6/sfXmFpeWV1rbhe2tjc2t4p7+61jMo0wyZTQulOSA0KLrFpuRXYSTXSJBTYDkfXU799j9pwJe/sOMUgoQPJY86odVK7J1w0ov1ypVb1ZyB/kwos0OiX33uRYlmC0jJBjenW/NQGOdWWM4GTUi8zmFI2ogPsOippgibIZ+dOyJFTIhIr7Z60ZKZ+nchpYsw4CV0yoXZofnpT8Tevm9n4Msi5TDOLks0XxZkgVpHp30nENTIrxo5Qprm7lbAh1ZRZ19D/OmidVGvn1bPb00r9atFGEQ7gEI6hBhdQhxtoQBMYjOABnuDZU96j9+K9zqMFbzGzD9/gvX0CvIKPcQ==</latexit>
ity so that M INIONS RL can dynamically scale the number Learner function
of actors in real-time as needed. Specifically, M INIONS RL
aims to address two primary challenges: Figure 3: M INIONS RL’s architecture.
Incorporate characteristics of DRL tasks. DRL training
significantly differs from other ML training, for example, the
recurrent interaction and online data sampling. To achieve based on the sampled data. At the end of round k, the cumu-
high performance and low cost, it’s necessary to incorporate lative reward achieved by the agent is represented as jk . Let
a
M INIONS RL’s scheduling with awareness of DRL workload Pk,i and Pkl denote the execution time of the ith actor and
characteristics. However, existing machine learning sched- learner function in round k, where each actor and learner
ulers are not designed for distributed DRL training (Guo function is allocated with da and dl resources, respectively.
et al. 2022; Wang, Niu, and Li 2019; Carreira et al. 2019), We use c to represent the unit price of executing a function
thus their tricks are not directly applicable. with a unit resource for one second. Thus, the duration Pk
Solution: The training process of M INIONS RL is de- and cost Ck of round k in on-policy training is given by
signed to be DRL objective and constraint-aware. To capture
Pk := Pkl + max{Pk,i
a
}, (1)
unique characteristics of DRL workloads, we embed criti- i
cal features into the states of M INIONS RL’s agent, such as Ik
X
the average final rewards and Kullback–Leibler (KL) diver- Ck := c Pkl dl + a a
Pk,i d . (2)
gence. The reward function of M INIONS RL’s agent is also i=1
crafted with awareness of the momentary budget and work- PK
load actor performance, guiding M INIONS RL to search for The goal is to minimize the training duration k=1 Pk
PK
optimal scheduling decisions through training. via Eq. 1 while the cost k=1 Ck via Eq. 2 subjects to a
Trade-off between training performance and cost. It’s monetary budget B, by deciding Ik in each round:
ambiguous to determine how many actors should be
K
launched in each round to hit a sweet spot between train- X
ing performance and cost. Moreover, it’s difficult to infer min Pkl + max{Pk,i
a
} , (3)
Ik i
the complicated dependency between actor scheduling and k=1
Final rewards
Final rewards
Final rewards
400 400
400 400
200 200
200 200
0 0
0 0
0 20 40 0 100 200 0 50 0 2000
# of round Wall clock time (s) # of round Wall clock time (s)
(a) Hopper-v3 (d) GravitarNoFrameskip-v4
Final rewards
Final rewards
Final rewards
400 400 200 200
100 100
200 200
0 0
0 0
0 50 0 200 400 0 50 0 1000
# of round Wall clock time (s) # of round Wall clock time (s)
(b) Humanoid-v3
(e) SpaceInvadersNoFrameskip-v4
300 300
Final rewards
Final rewards
Final rewards
Final rewards
500 500
200 200
0 0 MinionsRL
100 100 Azure ML
−500 −500 0 0 IMPACT
Figure 4: M INIONS RL outperforms baselines on statistical and time efficiency for continuous and discrete control tasks.
Cumulative rewards
600 300 600 600
Final rewards
Final rewards
200
400
400 400
100
200
0 200 200
0
40 A B C 40 A B C 0 0
0 100 0 500
# of rounds Wall clock time (s)
# of actors
# of actors
(a) Hopper-v3
20 20
300 300
Final rewards
Final rewards
0 0 200 200
0 20 40 0 10 20 30
# of round # of round MinionsRL
100 100
(a) Hopper-v3 (b) QbertNoFrameskip-v4 MinionsRL-Adapt
0 MinionsRL-Max 0
Figure 5: M INIONS RL’s actor scheduling decisions on two 0 50 0 2000
tasks. M INIONS RL dynamically schedules actors to balance # of round Wall clock time (s)
(b) QbertNoFrameskip-v4
training performance and cost.
Figure 6: Ablation study of M INIONS RL with its two vari-
ants: M INIONS RL-Adapt and M INIONS RL-Max.
the same budget. The performance variation is subtle from
the beginning and gradually increases as training proceeds.
The variation drops at the final parts because some of the
five experiments have ended earlier (either reaching desired Ablation Study
rewards or running out of budget). Thus, only one or two
experiments proceed to further rounds/timestamps, leaving
less variation—zero variation at the end if only one experi- To verify the effectiveness of two key components: server-
ment remains. The results show that M INIONS RL is more less functions and DRL-based scheduler, we compare M IN -
efficient in transforming the monetary budget into train- IONS RL with two variants of itself: 1) M INIONS RL-Max
ing time. Under the same budget, M INIONS RL trains much statically launches all 32 actors in every training round, and
faster than Azure ML and IMPACT in statistical efficiency 2) M INIONS RL-Adapt schedules actors with a naive, re-
and wall clock time with similar or better performance. ward ratio-based scheduler. Let J be the target final reward
Training cost. Table 2 reports the total training time and and Imax be the maximum number of available actors per
costs when baselines reached the same final rewards. Com- round. Let Jˆk denote approximated final reward that the
pared to Azure ML and IMPACT, M INIONS RL reduces learner policy can achieve at round k, which is computed us-
training time and costs up to 52% and 86%, respectively. ing a moving window averaged over the last n rounds given
Pk−1
by Jˆk := x=k−n−1 Jx . M INIONS RL-Adapt schedules a
Actor Scheduling set of actor functions Ik proportional to the ratio of reward
ˆ
Jˆ and J, which is given by Ik := clip(1, JJk Imax , Imax ).
We record and report how M INIONS RL makes actor This naive scheduler follows the intuition that a better policy
scheduling decisions to investigate the rationale behind the may produce better data, so we proportionally allocate more
performance gain compared with the baselines. Fig. 5 de- actors when the policy quality is higher. We set the moving
picts the number of actors M INIONS RL schedules and final window size n = 5 in the evaluation.
rewards per round on Hopper-v3 and QbertNoFrameSkip-
v4, respectively. We use A, B, and C for convenience when Final rewards. Fig. 6 shows the final rewards averaged
referring to the three phases of decisions made by M IN - over five times of repeated experiments for Hopper-v3 and
IONS RL in Fig. 5. For Hopper-v3, M INIONS RL launches QbertNoFrameskip-v4, respectively. By comparing M IN -
more actors at the beginning of Phase A to boost training IONS RL with M INIONS RL-Max, we can observe that M IN -
and gradually decreases the number of actors to save cost IONS RL’s DRL-based scheduler can preserve similar or
when performance steadies in Phase B and C. More actors better training efficiency while saving actor costs. Note
are launched by M INIONS RL at the end of Phase C to ex- that M INIONS RL-Max also runs the same DRL tasks with
plore optimal performance. We observe similar results on serverless functions. When comparing M INIONS RL with
QbertNoFrameSkip-v4, where M INIONS RL boosts training M INIONS RL-Adapt, the results demonstrate that M INION -
with more actors in Phase A and B, and reduces actors in S RL’s DRL-guided scheduler makes better decisions on ac-
steady Phase C to save cost. tor scheduling than the naive ratio-based scheduler.
In contrast to two baselines (i.e., Azure ML and IMPACT) Training cost. Table 2 shows M INIONS RL’s the total train-
that launch a fixed number of actors for every round, M IN - ing time and costs and the two variants when reaching the
IONS RL dynamically schedules actors throughout the train- same final rewards. Compared to M INIONS RL-Max, M IN -
ing process to strike a balance between training performance IONS RL significantly reduces training cost by up to 44%
and cost, thus completing training tasks cheaper and faster. while completing training with a similar duration.
Round Completion time (s) Hopper HalfCheetah SpaceInvaders tation load to the learner.
Humanoid Gravitar Qbert
50
Breakdown
20
Latency breakdown. Fig. 8(a) and (b) characterize the la-
10
tency breakdown of interaction between actor and learner
5
function in M INIONS RL’s one-round training. Launching
0 5 10 15 20 25 30 35 an actor and learner function takes around 300 and 1500 ms
# of actors (attaching GPUs to the learner container takes more time),
respectively. We further eliminate the startup overhead by
Figure 7: Scalability of M INIONS RL with respect to the function pre-warming.
number of actors in six environments. Communication overheads. M INIONS RL uses the efficient
gRPC library to enable lightweight communication between
actor and learner functions . Fig. 8(a) and (b) show the com-
Qbert SpaceInvaders Gravitar
Hopper
Actor Actor
Learner munication overhead between actor and learner functions.
Learner For (continuous) Mujoco environments, transferring 65,536
Actor Startup timesteps between actor and learner function incurs less than
HalfCheetah Humanoid
Learner Learner
Scheduler Training Overhead Mitigation
0 2 4 6 8 0 10 20 30
Wall clock time (s) Wall clock time (s) M INIONS RL trains the scheduler for each DRL task,
(a) Continuous Environments (b) Discrete Environments which may lead to high overheads. For example, training
a scheduler for Humanoid-v3/SpaceInvadersNoFrameskip-
Figure 8: Latency breakdown of interaction between actor v4 from scratch took around 10/50 hours. We further in-
and learner function in M INIONS RL’s one-round training. vestigate mitigating such overheads by fine-tuning a trained
scheduler of one task to other tasks. Fine-tuning M IN -
IONS RL is feasible since different DRL tasks have the
600 600
Final rewards
Final rewards
Final rewards