Inference Serving System For Stable Diffusion As A Service
Inference Serving System For Stable Diffusion As A Service
Abstract—We present a model-less, privacy-preserving, low- ence framework to satisfy user-defined SLOs for SDaaS.
latency inference framework to satisfy user-defined System-Level Developers of SD models can register their models on our
Objectives (SLO) for Stable Diffusion as a Service (SDaaS). proposed system through a declarative API. Users, on the
Developers of Stable Diffusion (SD) models register their trained
models on our proposed system through a declarative API. other hand, can specify SLOs in terms of the style of the
Users, on the other hand, can specify SLOs in terms of the generated image, the requested latency, and the minimum
style of the generated image for their input text, the requested requested CLIP score for inference through the user API.
processing latency, and the minimum requested text-to-image Our proposed system manages model registration from the
similarity (CLIP score) for inference through the user API. developers, and schedules volumes of user queries aimed to
Assuming black-box access to the registered models, we profile
them on hardware accelerators to design an inference predictor meet SLOs through an efficient deployment of the models onto
module. It heuristically predicts the required number of inference hardware accelerators in the compute cluster.
steps for the user-requested text-to-image CLIP score and the The rest of the paper is organized as follows. Section II
2024 IEEE Cloud Summit | 979-8-3503-7006-5/24/$31.00 ©2024 IEEE | DOI: 10.1109/CLOUD-SUMMIT61220.2024.00009
requested latency, for a specific SD model over a hardware highlights our key findings to guide our inference framework
accelerator, to satisfy the SLO. In combination with the inference development, followed by our proposed system design in
predictor module, we propose a shortest-job first algorithm for
our inference framework. Compared to traditional Deep Neural Section III. Section IV delves into our evaluation results, and
Network (DNN) and Large Language Model (LLM) inference section V concludes our paper.
scheduling algorithms, our proposed method outperforms on
average job completion time, and the average number of SLOs II. M OTIVATION
satisfied in a user-defined SLO scenario.
Index Terms—Inference Serving System, Stable Diffusion We evaluate the runtime latencies for a specific prompt
across 9 pre-trained SD models1 for text-to-image genera-
I. I NTRODUCTION tion. We specifically investigate the performance discrepancy
between CPU and GPU for a single inference step in the
To serve traditional DNN models in the cloud, the trained
diffusion process. The aggregated findings for a randomly
models are deployed on CPUs [1], [2], and one single forward
selected prompt ‘an astronaut riding a cow’, are presented in
pass cycle is initiated for every input query to generate a
Table I, utilizing a floating-point precision of 16 consistently
classification label. The optimizations are thereby focused on
across all SD models. We used an X86 64, employing 64-bit
model switching [1], ease-of-use [3], and lower inference
processing, powered by an Intel(R) Xeon(R) CPU @ 2.20GHz
latency [2], to highlight a few. Unlike so, in SD models,
with an advertised frequency of 2.2000 GHz, and Tesla V100
the input text is mapped to the token embeddings as a
as our CPU and GPU respectively. Our analysis reveals that
representation of the input text, and starting with a random
for all the evaluated SD models, the average inference time
noisy latent image information array, the diffusion refines the
on CPU is 92x higher, on average, compared to a GPU
information array such that the image decoder uses it to decode
the final image. This process happens in a step-by-step fashion, deployment. 1 This highlights the need for scheduling SD
with each diffusion step adding more relevant information inferencing workloads on GPU to achieve significantly lower
to the latent array. With generative Artificial Intelligence latency.
(AI) models, particularly SD for text-to-image generation, From the data presented in Table I, we observe the vari-
being progressively deployed in the cloud [4], [5], and the ability in inference times, when utilizing the same compute
striking difference in the inference process compared to DNNs resource across a variety of SD models, for an identical query.
motivates us to design a model-less, privacy-preserving, low- 2 This elucidates that every SD model exhibits significant
latency inference framework for SDaaS. variations in their resource requirements.
Developers are incrementally advancing the state-of-the- Next, we subject the identical inference workload for one
art SD models aimed at facilitating text-to-image conversion. SD model, across various hardware accelerators. In Table
For a variety of pre-trained SD models, each variant exhibits II, we present the average inference time for 10 inference
variations in resource footprints and processing time latencies steps, for the SD model ‘dreamlike-art/dreamlike-anime-1.0’,
across heterogeneous compute resources. In this paper, we
present a model-less, privacy-preserving, low-latency infer- 1 https://huggingface.co/
14
Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Impact of inference latency for stable diffusion models on hardware accelerators.
GPU RTX4090 RTX3090 A100PCIE RTX4080 RTXA6000 RTXA4000 TESLAV100 RTX3060 A40 RTX3080
Inference Time (s) 5.473 5.952 3.684 7.417 5.591 9.601 6.804 13.997 6.214 7.211
Fig. 2: Processing latency for SD models, over (a) Tesla V100, (b) A100, (c) RTX4080 hardware accelerators. (d) Represents
the CLIP score (‘ViT-B/32’) of the SD models as a function of inference steps.
TABLE III: Impact of inference time to CLIP score for floating inference time predictor heuristically predicts the number
point precision in stable diffusion models. of inference steps, for a user-defined SLACLIP and
SLAlatency , for a specified SD model, over a hardware
Inference Half-precision (16FP) Single-precision (32FP)
accelerator, to satisfy the SLO for the specific query.
Steps Latency (s) CLIP Score Latency (s) CLIP Score
The other module, model registrar is responsible for
1 0.123 0.163 0.283 0.165
managing model registration from the developers to the
2 0.195 0.168 0.386 0.175
3 0.241 0.341 0.493 0.348
model repository through an API.
4 0.293 0.296 0.604 0.309 • Worker Worker nodes execute inference queries, follow-
5 0.339 0.269 0.704 0.295 ing the scheduler logic, as directed by the controller, over
6 0.376 0.327 0.805 0.334 the hardware accelerators. Hardware-specific execution
7 0.411 0.322 0.892 0.329 daemons manage the deployment and execution of mod-
8 0.447 0.314 0.985 0.324 els.
• Model Repository The Model Repository functions as a
high-capacity, persistent storage, containing the trained
SD models registered by the developers. The worker
nodes can access this storage to execute SD models on
hardware accelerators based on the user inference query.
• Metadata Store Assuming black-box access to the mod-
els to preserve developer privacy, all registered models are
profiled on the worker nodes over multiple text-to-image
prompts for varying inference steps. The metadata store
includes information concerning available models, and
the profiled attributes of the number of inference steps
Fig. 3: Overview of our proposed system architecture.
to mean CLIP score and mean inference latency. The
queries from the users. It comprises of two modules: (a) inference time predictor accesses this data to heuristically
scheduler logic block, and (b) model registrar. The sched- predict the number of inference steps, for the SLACLIP
uler logic block consists of the inference time predictor and SLAlatency for a specified model, for a hardware
and the scheduling algorithm. Assuming black-box access accelerator, to satisfy the SLO for the specific query.
to the models to preserve developer privacy, all regis-
tered models are profiled on the worker nodes (hardware IV. E VALUATION
accelerators) over multiple text-to-image prompts for We evaluated our inferencing framework over 3 SD
varying inference steps. Based on the profiling results, the models, namely ‘DGSpitzer/Cyberpunk-Anime-Diffusion’,
15
Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.
‘runwayml/stable-diffusion-v1-5’, and ‘hakurei/waifu- 50 44
49
43
diffusion’ corresponding to styles (s) anime, regular diffusion, 40 37
35 35
# of SLA Met
and waifu respectively. To simulate the text prompts tp in user
queries, we randomly sampled 100 prompts from Flickr8k 30 27
14
18
17
13
tp with s: regular diffusion, 33 tp with s: anime, 32 tp with 11 CPSAT
10 6
SJF
FIFO
s: waifu). The requested CLIP score (SLACLIP ) is sampled 2 MLFQ
from a normal distribution with µ: 0.3, σ: 0.02, while the 0
x
0.5
1.0
1.5
2.0
requested latency (SLAlatency ) is sampled from normal
Scaling Requested Latencies of Jobs in Job Queue
distribution with µ: 10, σ: 2, in the user query. Due to a
lack of publicly available traces for generative AI workloads, Fig. 6: Variation in SLOlatency to the number of SLOs met.
we opted for increasing workload pattern [7], for the job Our proposed inference time predictor in combination with
arrival rate. We set a job arrival rate of 0.8 jobs/second, SJF satisfies the most number of SLOs.
increasing linearly to 0.85 jobs/second over 100 seconds.
All experiments were performed over NVIDIA T4, NVIDIA Linear Programming2 (CP-SAT) used to schedule traditional
A10G, at a fixed floating point precision 16 for SD model. DNNs (in our case, we utilize the inference predictor to predict
The CLIP score was estimated with a pre-trained ‘ViT-B/32’ the number of inference steps for processing latency), and our
model. proposed Shortest Job First (SJF) algorithm in combination
with the inference time predictor.
80
84.23
CPSAT We first evaluate a scenario wherein the requested job
SJF
70 FIFO
MLFQ
queries are characterized by tp , SLOCLIP and s. We scale
60 58.99 the job arrival rate to compute the average job completion
Avg. JCT
54.63
50 49.74
47.35 time. Our results, as presented in Fig. 4 indicate the highest
40 41.87
35.08 35.78 system throughput for our proposed inference time predic-
30 29.78
24.48
tor module in combination with SJF. Next, we evaluate a
20 18.49
20.76
13.80
18.10
14.32
10 8.45 scenario wherein the jobs queries are characterized by tp ,
SLOCLIP , SLOlatency and s. On scaling the job arrival rate
x
x
0.5
1.0
1.5
2.0
Scaling Arrival Times of Jobs in Job Queue and SLOlatency , our results indicate our proposed inference
time predictor in combination with SJF outperforms the base-
Fig. 4: Variation in job arrival rate to average job completion line, as presented in Fig. 5, and 6 respectively.
time. Our proposed inference time predictor in combination
with SJF has the highest throughput. V. C ONCLUSION
Our proposed model-less, privacy-preserving, low-latency
54
50
51 inferencing framework for SDaaS outperforms the baseline
44
inference scheduling approaches. It would be essential to
40 40
# of SLA Met
37 37 36 37
conduct evaluations on an expanded cloud trace, incorpo-
30 24
27
rating additional model variants and alternative job arrival
20 18 patterns [7]. The inference time predictor can be dynamically
11
13 CPSAT
10 8
6
SJF
FIFO
updated to adapt to SLO needs.
2 MLFQ
0 R EFERENCES
x
x
0.5
1.0
1.5
2.0
16
Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.