Inference Serving System For Stable Diffusion As A Service

Uploaded by

kirai.wendong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views4 pages

Inference Serving System For Stable Diffusion As A Service

Uploaded by

kirai.wendong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

2024 IEEE Cloud Summit

Inference Serving System for Stable Diffusion as a

Service
Aritra Ray∗ , Lukas Dannull∗ , Farshad Firouzi∗ , Kyle Lafata∗ , Krishnendu Chakrabarty†
∗ Duke University † Arizona State University

Abstract—We present a model-less, privacy-preserving, low- ence framework to satisfy user-defined SLOs for SDaaS.
latency inference framework to satisfy user-defined System-Level Developers of SD models can register their models on our
Objectives (SLO) for Stable Diffusion as a Service (SDaaS). proposed system through a declarative API. Users, on the
Developers of Stable Diffusion (SD) models register their trained
models on our proposed system through a declarative API. other hand, can specify SLOs in terms of the style of the
Users, on the other hand, can specify SLOs in terms of the generated image, the requested latency, and the minimum
style of the generated image for their input text, the requested requested CLIP score for inference through the user API.
processing latency, and the minimum requested text-to-image Our proposed system manages model registration from the
similarity (CLIP score) for inference through the user API. developers, and schedules volumes of user queries aimed to
Assuming black-box access to the registered models, we profile
them on hardware accelerators to design an inference predictor meet SLOs through an efficient deployment of the models onto
module. It heuristically predicts the required number of inference hardware accelerators in the compute cluster.
steps for the user-requested text-to-image CLIP score and the The rest of the paper is organized as follows. Section II
2024 IEEE Cloud Summit | 979-8-3503-7006-5/24/$31.00 ©2024 IEEE | DOI: 10.1109/CLOUD-SUMMIT61220.2024.00009

requested latency, for a specific SD model over a hardware highlights our key findings to guide our inference framework
accelerator, to satisfy the SLO. In combination with the inference development, followed by our proposed system design in
predictor module, we propose a shortest-job first algorithm for
our inference framework. Compared to traditional Deep Neural Section III. Section IV delves into our evaluation results, and
Network (DNN) and Large Language Model (LLM) inference section V concludes our paper.
scheduling algorithms, our proposed method outperforms on
average job completion time, and the average number of SLOs II. M OTIVATION
satisfied in a user-defined SLO scenario.
Index Terms—Inference Serving System, Stable Diffusion We evaluate the runtime latencies for a specific prompt
across 9 pre-trained SD models1 for text-to-image genera-
I. I NTRODUCTION tion. We specifically investigate the performance discrepancy
between CPU and GPU for a single inference step in the
To serve traditional DNN models in the cloud, the trained
diffusion process. The aggregated findings for a randomly
models are deployed on CPUs [1], [2], and one single forward
selected prompt ‘an astronaut riding a cow’, are presented in
pass cycle is initiated for every input query to generate a
Table I, utilizing a floating-point precision of 16 consistently
classification label. The optimizations are thereby focused on
across all SD models. We used an X86 64, employing 64-bit
model switching [1], ease-of-use [3], and lower inference
processing, powered by an Intel(R) Xeon(R) CPU @ 2.20GHz
latency [2], to highlight a few. Unlike so, in SD models,
with an advertised frequency of 2.2000 GHz, and Tesla V100
the input text is mapped to the token embeddings as a
as our CPU and GPU respectively. Our analysis reveals that
representation of the input text, and starting with a random
for all the evaluated SD models, the average inference time
noisy latent image information array, the diffusion refines the
on CPU is 92x higher, on average, compared to a GPU
information array such that the image decoder uses it to decode
the final image. This process happens in a step-by-step fashion, deployment. 1 This highlights the need for scheduling SD
with each diffusion step adding more relevant information inferencing workloads on GPU to achieve significantly lower
to the latent array. With generative Artificial Intelligence latency.
(AI) models, particularly SD for text-to-image generation, From the data presented in Table I, we observe the vari-
being progressively deployed in the cloud [4], [5], and the ability in inference times, when utilizing the same compute
striking difference in the inference process compared to DNNs resource across a variety of SD models, for an identical query.
motivates us to design a model-less, privacy-preserving, low- 2 This elucidates that every SD model exhibits significant
latency inference framework for SDaaS. variations in their resource requirements.
Developers are incrementally advancing the state-of-the- Next, we subject the identical inference workload for one
art SD models aimed at facilitating text-to-image conversion. SD model, across various hardware accelerators. In Table
For a variety of pre-trained SD models, each variant exhibits II, we present the average inference time for 10 inference
variations in resource footprints and processing time latencies steps, for the SD model ‘dreamlike-art/dreamlike-anime-1.0’,
across heterogeneous compute resources. In this paper, we
present a model-less, privacy-preserving, low-latency infer- 1 https://huggingface.co/

979-8-3503-7006-5/24/$31.00 ©2024 IEEE 13

DOI 10.1109/Cloud-Summit61220.2024.00009
Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Single step inference latency for stable diffusion variable inference steps along with their corresponding CLIP
models on CPU and GPU. Scores (‘ViT-B/32’), as shown in Fig. 1. Generally, we antici-
pate that as the number of inference steps increases, the quality
Stable Diffusion Models Inference Time (sec)
of the images will improve, leading to a proportional increase
CPU GPU
in the CLIP score. However, it is noteworthy that the variation
DGSpitzer/Cyberpunk-Anime-Diffusion 22.47 0.317
in generated features within the latent information array,
dreamlike-art/dreamlike-anime-1.0 77.41 0.576
CompVis/stable-diffusion-v1-4 27.21 0.309
in response to the input prompt, leads to a non-monotonic
stabilityai/stable-diffusion-2-1-base 21.77 0.315 relationship between the number of inference steps and the
stabilityai/stable-diffusion-2-1 86.64 0.734 CLIP score.
SG161222/Realistic Vision V1.4 35.34 0.304 To second this further, we extract the text captions from
hakurei/waifu-diffusion 24.94 0.328 the Flickr8k [6] dataset and feed 10% text prompts into SD
Nilaier/Waifu-Diffusers 22.58 0.314 models to evaluate the runtime latency and CLIP score over 3
kohbanye/pixel-art-style 27.85 0.302 hardware accelerators. 5 From Fig. 2, we observe a monotonic
increase in inference latency with respect to the inference
steps, but a non-monotonic relationship between the number
while achieving a CLIP score of 0.37 on ‘ViT-B/32’. 3 This of inference steps and the CLIP score.
elucidates the pronounced variability in processing latencies III. S YSTEM D ESIGN
across hardware accelerators in the compute cluster for SD In this section, we present the design of our proposed
models. model-less, privacy-preserving, low-latency inference frame-
To evaluate the influence of floating point precision in SD work to satisfy user-centric SLOs for SDaaS. The components
models, we present in Table III, the mean inference time of our proposed system are outlined subsequently.
over 10 inference steps, and CLIP score (‘ViT-B/32’) for the Design Principles. Developers of SD models can register
SD model ‘dreamlike-art/dreamlike-anime-1.0’, on the A100 their model on our proposed system through a declarative API.
PCIE GPU, for varying inference steps. With a higher floating Users, on the other hand, can specify SLOs in terms of the
point precision, the SD model achieves a marginally higher style of the generated image, the requested latency, and the
CLIP score in fewer inference steps, at the expense of higher minimum requested CLIP score for inference through the user
latency per inference steps. 4 This elucidates that there exists API. Our proposed system manages model registration from
a marginal trade-off between the floating point precision of the the developers, and schedules volumes of user queries to meet
SD model to the associated inference time latency and CLIP SLOs through an efficient deployment of the SD models to
score. hardware accelerators in the compute cluster.
Model-less interface for inference. The front-end interface
of our proposed system involves model registration from the
developers and the submission of user queries for inferencing
on the registered models.
• Model registration: The developers can register their
models using a declarative API. The API accommodates
a model identifier designated by the developer and the
trained weights of the SD model.
• Query Submission: Users can submit inference queries,
mentioning the requested text prompt tp to generate an
output image, the style of the generated image s, and
specify high-level performance criteria, like the requested
CLIP score (SLOCLIP ). The user can also specify a
requested SLO latency (SLOlatency ) within which the
request must be processed, alongside SLOCLIP .
Architecture. The controller, depending on the specified s
loads the appropriate model from the model repository, and
Fig. 1: Inference latency and CLIP Score (‘ViT-B/32’) for forwards the inference query to a worker machine with the
different styles of stable diffusion models, at seed 1024, for scheduling logic. The workers execute the inference queries
prompt ‘an astronaut riding a cow’. to hardware accelerators and subsequently respond with the
inference results to the user. If the user-defined SLO could
To illustrate the diversity of styles in the output image, we not be satisfied for a particular query, the system returns a
present a selection of generated images from 3 SD models: message appropriately to the user.
namely, ‘runwaymlstable-diffusion-v1-5’, ‘kohbanye/pixel-art- • Controller The centralized controller handles both model
style’, and ‘DGSpitzer/Cyberpunk-Anime-Diffusion’, across registration from developers, and SLO-based inference

Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Impact of inference latency for stable diffusion models on hardware accelerators.

GPU RTX4090 RTX3090 A100PCIE RTX4080 RTXA6000 RTXA4000 TESLAV100 RTX3060 A40 RTX3080
Inference Time (s) 5.473 5.952 3.684 7.417 5.591 9.601 6.804 13.997 6.214 7.211

Fig. 2: Processing latency for SD models, over (a) Tesla V100, (b) A100, (c) RTX4080 hardware accelerators. (d) Represents
the CLIP score (‘ViT-B/32’) of the SD models as a function of inference steps.

TABLE III: Impact of inference time to CLIP score for floating inference time predictor heuristically predicts the number
point precision in stable diffusion models. of inference steps, for a user-defined SLACLIP and
SLAlatency , for a specified SD model, over a hardware
Inference Half-precision (16FP) Single-precision (32FP)
accelerator, to satisfy the SLO for the specific query.
Steps Latency (s) CLIP Score Latency (s) CLIP Score
The other module, model registrar is responsible for
1 0.123 0.163 0.283 0.165
managing model registration from the developers to the
2 0.195 0.168 0.386 0.175
3 0.241 0.341 0.493 0.348
model repository through an API.
4 0.293 0.296 0.604 0.309 • Worker Worker nodes execute inference queries, follow-
5 0.339 0.269 0.704 0.295 ing the scheduler logic, as directed by the controller, over
6 0.376 0.327 0.805 0.334 the hardware accelerators. Hardware-specific execution
7 0.411 0.322 0.892 0.329 daemons manage the deployment and execution of mod-
8 0.447 0.314 0.985 0.324 els.
• Model Repository The Model Repository functions as a
high-capacity, persistent storage, containing the trained
SD models registered by the developers. The worker
nodes can access this storage to execute SD models on
hardware accelerators based on the user inference query.
• Metadata Store Assuming black-box access to the mod-
els to preserve developer privacy, all registered models are
profiled on the worker nodes over multiple text-to-image
prompts for varying inference steps. The metadata store
includes information concerning available models, and
the profiled attributes of the number of inference steps
Fig. 3: Overview of our proposed system architecture.
to mean CLIP score and mean inference latency. The
queries from the users. It comprises of two modules: (a) inference time predictor accesses this data to heuristically
scheduler logic block, and (b) model registrar. The sched- predict the number of inference steps, for the SLACLIP
uler logic block consists of the inference time predictor and SLAlatency for a specified model, for a hardware
and the scheduling algorithm. Assuming black-box access accelerator, to satisfy the SLO for the specific query.
to the models to preserve developer privacy, all regis-
tered models are profiled on the worker nodes (hardware IV. E VALUATION
accelerators) over multiple text-to-image prompts for We evaluated our inferencing framework over 3 SD
varying inference steps. Based on the profiling results, the models, namely ‘DGSpitzer/Cyberpunk-Anime-Diffusion’,

Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.
‘runwayml/stable-diffusion-v1-5’, and ‘hakurei/waifu- 50 44
49

43
diffusion’ corresponding to styles (s) anime, regular diffusion, 40 37
35 35

# of SLA Met
and waifu respectively. To simulate the text prompts tp in user
queries, we randomly sampled 100 prompts from Flickr8k 30 27

dataset [6], with a randomly assigned requested style s (35 20 16

14
18
17
13
tp with s: regular diffusion, 33 tp with s: anime, 32 tp with 11 CPSAT
10 6
SJF
FIFO
s: waifu). The requested CLIP score (SLACLIP ) is sampled 2 MLFQ
from a normal distribution with µ: 0.3, σ: 0.02, while the 0

x
0.5

1.0

1.5

2.0
requested latency (SLAlatency ) is sampled from normal
Scaling Requested Latencies of Jobs in Job Queue
distribution with µ: 10, σ: 2, in the user query. Due to a
lack of publicly available traces for generative AI workloads, Fig. 6: Variation in SLOlatency to the number of SLOs met.
we opted for increasing workload pattern [7], for the job Our proposed inference time predictor in combination with
arrival rate. We set a job arrival rate of 0.8 jobs/second, SJF satisfies the most number of SLOs.
increasing linearly to 0.85 jobs/second over 100 seconds.
All experiments were performed over NVIDIA T4, NVIDIA Linear Programming2 (CP-SAT) used to schedule traditional
A10G, at a fixed floating point precision 16 for SD model. DNNs (in our case, we utilize the inference predictor to predict
The CLIP score was estimated with a pre-trained ‘ViT-B/32’ the number of inference steps for processing latency), and our
model. proposed Shortest Job First (SJF) algorithm in combination
with the inference time predictor.
80
84.23
CPSAT We first evaluate a scenario wherein the requested job
SJF
70 FIFO
MLFQ
queries are characterized by tp , SLOCLIP and s. We scale
60 58.99 the job arrival rate to compute the average job completion
Avg. JCT

54.63
50 49.74
47.35 time. Our results, as presented in Fig. 4 indicate the highest
40 41.87
35.08 35.78 system throughput for our proposed inference time predic-
30 29.78
24.48
tor module in combination with SJF. Next, we evaluate a
20 18.49
20.76
13.80
18.10
14.32
10 8.45 scenario wherein the jobs queries are characterized by tp ,
SLOCLIP , SLOlatency and s. On scaling the job arrival rate
x

x
0.5

1.0

1.5

2.0

Scaling Arrival Times of Jobs in Job Queue and SLOlatency , our results indicate our proposed inference
time predictor in combination with SJF outperforms the base-
Fig. 4: Variation in job arrival rate to average job completion line, as presented in Fig. 5, and 6 respectively.
time. Our proposed inference time predictor in combination
with SJF has the highest throughput. V. C ONCLUSION
Our proposed model-less, privacy-preserving, low-latency
54

50
51 inferencing framework for SDaaS outperforms the baseline
44
inference scheduling approaches. It would be essential to
40 40
# of SLA Met

37 37 36 37
conduct evaluations on an expanded cloud trace, incorpo-
30 24
27
rating additional model variants and alternative job arrival
20 18 patterns [7]. The inference time predictor can be dynamically
11
13 CPSAT
10 8
6
SJF
FIFO
updated to adapt to SLO needs.
2 MLFQ
0 R EFERENCES
x

x
0.5

1.0

1.5

2.0

[1] J. Zhang et. al, “Model-switching: Dealing with fluctuating workloads in

Scaling Arrival Times of Jobs in Job Queue machine-learning-as-a-service systems,” in USENIX HotCloud 2020.
[2] J. R. Gunasekaran et. al., “Cocktail: A multidimensional optimization for
Fig. 5: Variation in job arrival rate to the number of SLOs model serving in cloud,” in USENIX NSDI 2022.
met. Our proposed inference time predictor in combination [3] F. Romero et. al., “Infaas: Automated model-less inference serving,” in
with SJF satisfies the most number of SLOs. USENIX ATC 2021.
[4] OpenAI, “Chatgpt,” Large language model, 2024. [Online]. Available:
https://chat.openai.com
Assuming a black-box access, we extract the meta-data [5] Microsoft, “Microsoft copilot,” AI-powered code completion tool, 2022.
of the runtime latencies and the associated CLIP score of [Online]. Available: https://developer.microsoft.com/en-us/copilot
the 3 SD models in a privacy-preserving fashion. Utilizing [6] M. Hodosh et. al., “Framing image description as a ranking task: Data,
models and evaluation metrics,” Journal of Artificial Intelligence Research
this data, we design the inference predictor to heuristically 2013.
predict the number of inference steps, for a user-defined [7] I. K. Kim et. al., “Forecasting cloud application workloads with
SLOCLIP and SLOlatency , for a specific SD model, for cloudinsight for predictive resource management,” IEEE Transactions on
Cloud Computing 2020.
a hardware accelerator, to satisfy the SLO for the specific [8] B. a. Wu et. al., “Fast distributed inference serving for large language
query. As a baseline, we use First In First Out (FIFO), no models,” arXiv preprint arXiv:2305.05920, 2023.
preemption Multi-Level Feedback Queue (MLFQ) used to
schedule LLMs [8] (without KV-caching), constrained Integer 2 https://github.com/google/or-tools

Authorized licensed use limited to: ShanghaiTech University. Downloaded on November 08,2024 at 09:11:14 UTC from IEEE Xplore. Restrictions apply.