0% found this document useful (0 votes)
339 views35 pages

Docs VLLM Ai en Stable

Uploaded by

Kunj Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
339 views35 pages

Docs VLLM Ai en Stable

Uploaded by

Kunj Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

vLLM

the vLLM Team

Mar 01, 2024


GETTING STARTED

1 Documentation 3

2 Indices and tables 27

Python Module Index 29

Index 31

i
ii
vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
• State-of-the-art serving throughput
• Efficient management of attention key and value memory with PagedAttention
• Continuous batching of incoming requests
• Fast model execution with CUDA/HIP graph
• Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
• Optimized CUDA kernels
vLLM is flexible and easy to use with:
• Seamless integration with popular HuggingFace models
• High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
• Tensor parallelism support for distributed inference
• Streaming outputs
• OpenAI-compatible API server
• Support NVIDIA GPUs and AMD GPUs
• (Experimental) Prefix caching support
• (Experimental) Multi-lora support
For more information, check out the following:
• vLLM announcing blog post (intro to PagedAttention)
• vLLM paper (SOSP 2023)
• How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel
et al.

GETTING STARTED 1
vLLM

2 GETTING STARTED
CHAPTER

ONE

DOCUMENTATION

1.1 Installation

vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.

1.1.1 Requirements

• OS: Linux
• Python: 3.8 – 3.11
• GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

1.1.2 Install with pip

You can install vLLM using pip:

$ # (Optional) Create a new conda environment.


$ conda create -n myenv python=3.9 -y
$ conda activate myenv

$ # Install vLLM with CUDA 12.1.


$ pip install vllm

Note: As of now, vLLM’s binaries are compiled on CUDA 12.1 by default. However, you can install vLLM with
CUDA 11.8 by running:

$ # Install vLLM with CUDA 11.8.


$ export VLLM_VERSION=0.2.4
$ export PYTHON_VERSION=39
$ pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/
˓→vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.

˓→whl

$ # Re-install PyTorch with CUDA 11.8.


$ pip uninstall torch -y
$ pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118

$ # Re-install xFormers with CUDA 11.8.


(continues on next page)

3
vLLM

(continued from previous page)


$ pip uninstall xformers -y
$ pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118

1.1.3 Build from source

You can also build and install vLLM from source:

$ git clone https://github.com/vllm-project/vllm.git


$ cd vllm
$ pip install -e . # This may take 5-10 minutes.

Tip: If you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.

$ # Use `--ipc=host` to make sure the shared memory is large enough.


$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

Note: If you are developing the C++ backend of vLLM, consider building vLLM with

$ python setup.py develop

since it will give you incremental builds. The downside is that this method is deprecated by setuptools.

1.2 Installation with ROCm

vLLM 0.2.4 onwards supports model inferencing and serving on AMD GPUs with ROCm. At the moment AWQ
quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported
in ROCm are FP16 and BF16.

1.2.1 Requirements

• OS: Linux
• Python: 3.8 – 3.11
• GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
• Pytorch 2.0.1/2.1.1/2.2
• ROCm 5.7 (Verified on python 3.10) or ROCm 6.0 (Verified on python 3.9)
Installation options:
1. (Recommended) Quick start with vLLM pre-installed in Docker Image
2. Build from source
3. Build from source with docker

4 Chapter 1. Documentation
vLLM

1.2.2 (Recommended) Option 1: Quick start with vLLM pre-installed in Docker Im-
age

This option is for ROCm 5.7 only:

$ docker pull embeddedllminfo/vllm-rocm:vllm-v0.2.4


$ docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
embeddedllminfo/vllm-rocm \
bash

1.2.3 Option 2: Build from source

You can build and install vLLM from source:


Below instruction is for ROCm 5.7 only. At the time of this documentation update, PyTorch on ROCm 6.0 wheel is not
yet available on the PyTorch website.
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
• ROCm
• Pytorch

$ pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.


˓→pytorch.org/whl/nightly/rocm5.7 # tested version

1. Install flash attention for ROCm


Install ROCm’s flash attention (v2.0.4) following the instructions from ROCmSoftwarePlatform/flash-
attention

Note:
• If you are using rocm5.7 with pytorch 2.1.0 onwards, you don’t need to apply the hipify_python.patch. You can
build the ROCm flash attention directly.
• If you fail to install ROCmSoftwarePlatform/flash-attention, try cloning from the commit
6fd2f8e572805681cd67ef8596c7e2ce521ed3c6.
• ROCm’s Flash-attention-2 (v2.0.4) does not support sliding windows attention.
• You might need to downgrade the “ninja” version to 1.10 it is not used when compiling flash-attention-2 (e.g.
pip install ninja==1.10.2.4)

2. Setup xformers==0.0.23 without dependencies, and apply patches to adapt for ROCm flash attention

$ pip install xformers==0.0.23 --no-deps


$ bash patch_xformers.rocm.sh

1.2. Installation with ROCm 5


vLLM

3. Build vLLM.

$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install # This may take 5-10 minutes. Currently, `pip␣
˓→install .`` does not work for ROCm installation

1.2.4 Option 3: Build from source with docker

You can build and install vLLM from source:


Build a docker image from Dockerfile.rocm, and launch a docker container.
The Dokerfile.rocm is designed to support both ROCm 5.7 and ROCm 6.0 and later versions. It provides flexibility to
customize the build of docker image using the following arguments:
• BASE_IMAGE: specifies the base image used when running docker build, specifically the Py-
Torch on ROCm base image. We have tested ROCm 5.7 and ROCm 6.0. The default is
rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1
• FX_GFX_ARCHS: specifies the GFX architecture that is used to build flash-attention, for example, gfx90a;gfx942
for MI200 and MI300. The default is gfx90a;gfx942
• FA_BRANCH: specifies the branch used to build the flash-attention in ROCmSoftwarePlatform’s flash-attention
repo. The default is 3d2b6f5
• BUILD_FA: specifies whether to build flash-attention. For Radeon RX 7900 series (gfx1100), this should be set
to 0 before flash-attention supports this target.
Their values can be passed in when running docker build with --build-arg options.
For example, to build docker image for vllm on ROCm 5.7, you can run:

$ docker build --build-arg BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.


˓→0.1" \

-f Dockerfile.rocm -t vllm-rocm .

To build vllm on ROCm 6.0, you can use the default:

$ docker build -f Dockerfile.rocm -t vllm-rocm .


$ docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
bash

Alternatively, if you plan to install vLLM-ROCm on a local machine or start from a fresh docker image (e.g.
rocm/pytorch), you can follow the steps below:
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
• ROCm

6 Chapter 1. Documentation
vLLM

• Pytorch
• hipBLAS
1. Install flash attention for ROCm
Install ROCm’s flash attention (v2.0.4) following the instructions from ROCmSoftwarePlatform/flash-
attention

Note:
• If you are using rocm5.7 with pytorch 2.1.0 onwards, you don’t need to apply the hipify_python.patch. You can
build the ROCm flash attention directly.
• If you fail to install ROCmSoftwarePlatform/flash-attention, try cloning from the commit
6fd2f8e572805681cd67ef8596c7e2ce521ed3c6.
• ROCm’s Flash-attention-2 (v2.0.4) does not support sliding windows attention.
• You might need to downgrade the “ninja” version to 1.10 it is not used when compiling flash-attention-2 (e.g.
pip install ninja==1.10.2.4)

2. Setup xformers==0.0.23 without dependencies, and apply patches to adapt for ROCm flash attention

$ pip install xformers==0.0.23 --no-deps


$ bash patch_xformers.rocm.sh

3. Build vLLM.

$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install # This may take 5-10 minutes.

Note:
• You may need to turn on the --enforce-eager flag if you experience process hang when running the bench-
mark_thoughput.py script to test your installation.

1.3 Quickstart

This guide shows how to use vLLM to:


• run offline batched inference on a dataset;
• build an API server for a large language model;
• start an OpenAI-compatible API server.
Be sure to complete the installation instructions before continuing with this guide.

Note: By default, vLLM downloads model from HuggingFace. If you would like to use models from ModelScope in
the following examples, please set the environment variable:

export VLLM_USE_MODELSCOPE=True

1.3. Quickstart 7
vLLM

1.3.1 Offline Batched Inference

We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to
generate texts for a list of input prompts.
Import LLM and SamplingParams from vLLM. The LLM class is the main class for running offline inference with
vLLM engine. The SamplingParams class specifies the parameters for the sampling process.

from vllm import LLM, SamplingParams

Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and
the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the class
definition.

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

Initialize vLLM’s engine for offline inference with the LLM class and the OPT-125M model. The list of supported
models can be found at supported models.

llm = LLM(model="facebook/opt-125m")

Call llm.generate to generate the outputs. It adds the input prompts to vLLM engine’s waiting queue and executes
the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of RequestOutput
objects, which include all the output tokens.

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.


for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The code example can also be found in examples/offline_inference.py.

1.3.2 OpenAI-Compatible Server

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in
replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can
specify the address with --host and --port arguments. The server currently hosts one model at a time (OPT-125M
in the command below) and implements list models, create chat completion, and create completion endpoints. We are
actively adding support for more endpoints.
Start the server:

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m

8 Chapter 1. Documentation
vLLM

By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using
the --chat-template argument:

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m \
$ --chat-template ./examples/template_chatml.jinja

This server can be queried in the same format as OpenAI API. For example, list the models:

$ curl http://localhost:8000/v1/models

You can pass in the argument --api-key or environment variable VLLM_API_KEY to enable the server to check for
API key in the header.

Using OpenAI Completions API with vLLM

Query the model with input prompts:

$ curl http://localhost:8000/v1/completions \
$ -H "Content-Type: application/json" \
$ -d '{
$ "model": "facebook/opt-125m",
$ "prompt": "San Francisco is a",
$ "max_tokens": 7,
$ "temperature": 0
$ }'

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using
OpenAI API. For example, another way to query the server is via the openai python package:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="facebook/opt-125m",
prompt="San Francisco is a")
print("Completion result:", completion)

For a more detailed client example, refer to examples/openai_completion_client.py.

1.3. Quickstart 9
vLLM

Using OpenAI Chat API with vLLM

The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations
with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-
forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed
explanations.
Querying the model using OpenAI Chat API:
You can use the create chat completion endpoint to communicate with the model in a chat-like interface:

$ curl http://localhost:8000/v1/chat/completions \
$ -H "Content-Type: application/json" \
$ -d '{
$ "model": "facebook/opt-125m",
$ "messages": [
$ {"role": "system", "content": "You are a helpful assistant."},
$ {"role": "user", "content": "Who won the world series in 2020?"}
$ ]
$ }'

Python Client Example:


Using the openai python package, you can also communicate with the model in a chat-like manner:

from openai import OpenAI


# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
model="facebook/opt-125m",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)

For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.

10 Chapter 1. Documentation
vLLM

1.4 Distributed Inference and Serving

vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel
algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with:

$ pip install ray

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs
you want to use. For example, to run inference on 4 GPUs:

from vllm import LLM


llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

To run multi-GPU serving, pass in the --tensor-parallel-size argument when starting the server. For example,
to run API server on 4 GPUs:

$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ --tensor-parallel-size 4

To scale vLLM beyond a single machine, start a Ray runtime via CLI before running vLLM:

$ # On head node
$ ray start --head

$ # On worker nodes
$ ray start --address=<ray-head-address>

After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node
by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.

1.5 Running on clouds with SkyPilot

vLLM can be run on the cloud to scale to multiple GPUs with SkyPilot, an open-source framework for running LLMs
on any cloud.
To install SkyPilot and setup your cloud credentials, run:

$ pip install skypilot


$ sky check

See the vLLM SkyPilot YAML for serving, serving.yaml.

resources:
accelerators: A100

envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer

setup: |
conda create -n vllm python=3.9 -y
(continues on next page)

1.4. Distributed Inference and Serving 11


vLLM

(continued from previous page)


conda activate vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install .
pip install gradio

run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py

Start the serving the LLaMA-13B model on an A100 GPU:

$ sky launch serving.yaml

Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in
your browser to use the LLaMA model to do the text completion.

(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

Optional: Serve the 65B model instead of the default 13B and use more GPU:

sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-


˓→research/llama-65b-hf

1.6 Deploying with KServe

vLLM can be deployed with KServe on Kubernetes for highly scalable distributed model serving.
Please see this guide for more details on using vLLM with KServe.

1.7 Deploying with NVIDIA Triton

The Triton Inference Server hosts a tutorial demonstrating how to quickly deploy a simple facebook/opt-125m model
using vLLM. Please see Deploying a vLLM model in Triton for more details.

12 Chapter 1. Documentation
vLLM

1.8 Deploying with Docker

vLLM offers official docker image for deployment. The image can be used to run OpenAI compatible server. The
image is available on Docker Hub as vllm/vllm-openai.

$ docker run --runtime nvidia --gpus all \


-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1

Note: You can either use the ipc=host flag or --shm-size flag to allow the container to access the host’s shared
memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly
for tensor parallel inference.

You can build and run vLLM from source via the provided dockerfile. To build vLLM:

$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai #␣


˓→optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2

Note: By default vLLM will build for all GPU types for widest distribution. If you are just building for the current GPU
type the machine is running on, you can add the argument --build-arg torch_cuda_arch_list="" for vLLM to
find the current GPU type and build for that.

To run vLLM:

$ docker run --runtime nvidia --gpus all \


-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
vllm/vllm-openai <args...>

1.9 Serving with Langchain

vLLM is also available via Langchain .


To install langchain, run

$ pip install langchain langchain_community -q

To run inference on a single or multiple GPUs, use VLLM class from langchain.

from langchain_community.llms import VLLM

llm = VLLM(model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
(continues on next page)

1.8. Deploying with Docker 13


vLLM

(continued from previous page)


top_k=10,
top_p=0.95,
temperature=0.8,
# tensor_parallel_size=... # for distributed inference
)

print(llm("What is the capital of France ?"))

Please refer to this Tutorial for more details.

1.10 Production Metrics

vLLM exposes a number of metrics that can be used to monitor the health of the system. These metrics are exposed
via the /metrics endpoint on the vLLM OpenAI compatible API server.
The following metrics are exposed:

class Metrics:

def __init__(self, labelnames: List[str]):


# Unregister any existing vLLM collectors
for collector in list(REGISTRY._collector_to_names):
if hasattr(collector, "_name") and "vllm" in collector._name:
REGISTRY.unregister(collector)

self.info_cache_config = Info(
name='vllm:cache_config',
documentation='information of cache_config')

# System stats
self.gauge_scheduler_running = Gauge(
name="vllm:num_requests_running",
documentation="Number of requests currently running on GPU.",
labelnames=labelnames)
self.gauge_scheduler_swapped = Gauge(
name="vllm:num_requests_swapped",
documentation="Number of requests swapped to CPU.",
labelnames=labelnames)
self.gauge_scheduler_waiting = Gauge(
name="vllm:num_requests_waiting",
documentation="Number of requests waiting to be processed.",
labelnames=labelnames)
self.gauge_gpu_cache_usage = Gauge(
name="vllm:gpu_cache_usage_perc",
documentation="GPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)
self.gauge_cpu_cache_usage = Gauge(
name="vllm:cpu_cache_usage_perc",
documentation="CPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)

(continues on next page)

14 Chapter 1. Documentation
vLLM

(continued from previous page)


# Raw stats from last model iteration
self.counter_prompt_tokens = Counter(
name="vllm:prompt_tokens_total",
documentation="Number of prefill tokens processed.",
labelnames=labelnames)
self.counter_generation_tokens = Counter(
name="vllm:generation_tokens_total",
documentation="Number of generation tokens processed.",
labelnames=labelnames)
self.histogram_time_to_first_token = Histogram(
name="vllm:time_to_first_token_seconds",
documentation="Histogram of time to first token in seconds.",
labelnames=labelnames,
buckets=[
0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5,
0.75, 1.0, 2.5, 5.0, 7.5, 10.0
])
self.histogram_time_per_output_token = Histogram(
name="vllm:time_per_output_token_seconds",
documentation="Histogram of time per output token in seconds.",
labelnames=labelnames,
buckets=[
0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75,
1.0, 2.5
])
self.histogram_e2e_request_latency = Histogram(
name="vllm:e2e_request_latency_seconds",
documentation="Histogram of end to end request latency in seconds.",
labelnames=labelnames,
buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])

# Legacy metrics
self.gauge_avg_prompt_throughput = Gauge(
name="vllm:avg_prompt_throughput_toks_per_s",
documentation="Average prefill throughput in tokens/s.",
labelnames=labelnames,
)
self.gauge_avg_generation_throughput = Gauge(
name="vllm:avg_generation_throughput_toks_per_s",
documentation="Average generation throughput in tokens/s.",
labelnames=labelnames,
)

1.10. Production Metrics 15


vLLM

1.11 Supported Models

vLLM supports a variety of generative Transformer models in HuggingFace Transformers. The following is the list
of model architectures that are currently supported by vLLM. Alongside each architecture, we include some popular
models that use it.

Architecture Models Example HuggingFace Models


AquilaForCausalLM Aquila BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.
BaiChuanForCausalLM Baichuan baichuan-inc/Baichuan2-13B-Chat,
baichuan-inc/Baichuan-7B, etc.
ChatGLMModel ChatGLM THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.
DeciLMForCausalLM DeciLM Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.
BloomForCausalLM BLOOM, BLOOMZ, bigscience/bloom, bigscience/bloomz, etc.
BLOOMChat
FalconForCausalLM Falcon tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/
falcon-rw-7b, etc.
GemmaForCausalLM Gemma google/gemma-2b, google/gemma-7b, etc.
GPT2LMHeadModel GPT-2 gpt2, gpt2-xl, etc.
GPTBigCodeForCausalLM StarCoder, SantaCoder, bigcode/starcoder, bigcode/
WizardCoder gpt_bigcode-santacoder, WizardLM/
WizardCoder-15B-V1.0, etc.
GPTJForCausalLM GPT-J EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.
GPTNeoXForCausalLM GPT-NeoX, Pythia, Ope- EleutherAI/gpt-neox-20b, EleutherAI/
nAssistant, Dolly V2, Sta- pythia-12b, OpenAssistant/
bleLM oasst-sft-4-pythia-12b-epoch-3.5,
databricks/dolly-v2-12b, stabilityai/
stablelm-tuned-alpha-7b, etc.
InternLMForCausalLM InternLM internlm/internlm-7b, internlm/
internlm-chat-7b, etc.
InternLM2ForCausalLM InternLM2 internlm/internlm2-7b, internlm/
internlm2-chat-7b, etc.
LlamaForCausalLM LLaMA, LLaMA-2, Vi- meta-llama/Llama-2-13b-hf, meta-llama/
cuna, Alpaca, Yi Llama-2-70b-hf, openlm-research/
open_llama_13b, lmsys/vicuna-13b-v1.3,
01-ai/Yi-6B, 01-ai/Yi-34B, etc.
MistralForCausalLM Mistral, Mistral-Instruct mistralai/Mistral-7B-v0.1, mistralai/
Mistral-7B-Instruct-v0.1, etc.
MixtralForCausalLM Mixtral-8x7B, Mixtral- mistralai/Mixtral-8x7B-v0.1, mistralai/
8x7B-Instruct Mixtral-8x7B-Instruct-v0.1, etc.
MPTForCausalLM MPT, MPT-Instruct, MPT- mosaicml/mpt-7b, mosaicml/
Chat, MPT-StoryWriter mpt-7b-storywriter, mosaicml/mpt-30b, etc.
OLMoForCausalLM OLMo allenai/OLMo-1B, allenai/OLMo-7B, etc.
OPTForCausalLM OPT, OPT-IML facebook/opt-66b, facebook/opt-iml-max-30b,
etc.
OrionForCausalLM Orion OrionStarAI/Orion-14B-Base, OrionStarAI/
Orion-14B-Chat, etc.
PhiForCausalLM Phi microsoft/phi-1_5, microsoft/phi-2, etc.
QWenLMHeadModel Qwen Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.
Qwen2ForCausalLM Qwen2 Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat,
etc.
StableLmForCausalLM StableLM stabilityai/stablelm-3b-4e1t/ , stabilityai/
stablelm-base-alpha-7b-v2, etc.

16 Chapter 1. Documentation
vLLM

If your model uses one of the above model architectures, you can seamlessly run your model with vLLM. Otherwise,
please refer to Adding a New Model for instructions on how to implement support for your model. Alternatively, you
can raise an issue on our GitHub project.

Note: Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.

Tip: The easiest way to check if your model is supported is to run the program below:

from vllm import LLM

llm = LLM(model=...) # Name or path of your model


output = llm.generate("Hello, my name is")
print(output)

If vLLM successfully generates text, it indicates that your model is supported.

Tip: To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

$ export VLLM_USE_MODELSCOPE=True

And use with trust_remote_code=True.

from vllm import LLM

llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model


output = llm.generate("Hello, my name is")
print(output)

1.12 Adding a New Model

This document provides a high-level guide on integrating a HuggingFace Transformers model into vLLM.

Note: The complexity of adding a new model depends heavily on the model’s architecture. The process is considerably
straightforward if the model shares a similar architecture with an existing model in vLLM. However, for models that
include new operators (e.g., a new attention mechanism), the process can be a bit more complex.

Tip: If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our GitHub
repository. We will be happy to help you out!

1.12. Adding a New Model 17


vLLM

1.12.1 0. Fork the vLLM repository

Start by forking our GitHub repository and then build it from source. This gives you the ability to modify the codebase
and test your model.

1.12.2 1. Bring your model code

Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the
vllm/model_executor/models directory. For instance, vLLM’s OPT model was adapted from the HuggingFace’s
modeling_opt.py file.

Warning: When copying the model code, make sure to review and adhere to the code’s copyright and licensing
terms.

1.12.3 2. Rewrite the forward methods

Next, you need to rewrite the forward methods of your model by following these steps:
1. Remove any unnecessary code, such as the code only used for training.
2. Change the input parameters:

def forward(
self,
input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
-) -> Union[Tuple, CausalLMOutputWithPast]:
+ positions: torch.Tensor,
+ kv_caches: List[KVCache],
+ input_metadata: InputMetadata,
+) -> Optional[SamplerOutput]:

1. Update the code by considering that input_ids and positions are now flattened tensors.
2. Replace the attention operation with either PagedAttention, PagedAttentionWithRoPE, or
PagedAttentionWithALiBi depending on the model’s architecture.

Note: Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional
embeddings. If your model employs a different attention mechanism, you will need to implement a new attention layer
in vLLM.

18 Chapter 1. Documentation
vLLM

1.12.4 3. (Optional) Implement tensor parallelism and quantization support

If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it. To do this, substitute
your model’s linear and embedding layers with their tensor-parallel versions. For the embedding layer, you can simply
replace nn.Embedding with VocabParallelEmbedding. For the output LM head, you can use ParallelLMHead.
When it comes to the linear layers, we provide the following options to parallelize them:
• ReplicatedLinear: Replicates the inputs and weights across multiple GPUs. No memory saving.
• RowParallelLinear: The input tensor is partitioned along the hidden dimension. The weight matrix is parti-
tioned along the rows (input dimension). An all-reduce operation is performed after the matrix multiplication to
reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention
layer.
• ColumnParallelLinear: The input tensor is replicated. The weight matrix is partitioned along the columns
(output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer
and the separated QKV transformation of the attention layer in the original Transformer.
• MergedColumnParallelLinear: Column-parallel linear that merges multiple ColumnParallelLinear opera-
tors. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles
the sharded weight loading logic of multiple weight matrices.
• QKVParallelLinear: Parallel linear layer for the query, key, and value projections of the multi-head and
grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class
replicates the key/value heads properly. This class handles the weight loading and replication of the weight
matrices.
Note that all the linear layers above take linear_method as an input. vLLM will set this parameter according to different
quantization schemes to support weight quantization.

1.12.5 4. Implement the weight loading logic

You now need to implement the load_weights method in your *ForCausalLM class. This method should load the
weights from the HuggingFace’s checkpoint file and assign them to the corresponding layers in your model. Specifically,
for MergedColumnParallelLinear and QKVParallelLinear layers, if the original model has separated weight matrices,
you need to load the different parts separately.

1.12.6 5. Register your model

Finally, include your *ForCausalLM class in vllm/model_executor/models/__init__.py and register it to the


_MODEL_REGISTRY in vllm/model_executor/model_loader.py.

1.13 Engine Arguments

Below, you can find an explanation of every engine argument for vLLM:
--model <model_name_or_path>
Name or path of the huggingface model to use.
--tokenizer <tokenizer_name_or_path>
Name or path of the huggingface tokenizer to use.

1.13. Engine Arguments 19


vLLM

--revision <revision>
The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use
the default version.
--tokenizer-revision <revision>
The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will
use the default version.
--tokenizer-mode {auto,slow}
The tokenizer mode.
• “auto” will use the fast tokenizer if available.
• “slow” will always use the slow tokenizer.
--trust-remote-code
Trust remote code from huggingface.
--download-dir <directory>
Directory to download and load the weights, default to the default cache dir of huggingface.
--load-format {auto,pt,safetensors,npcache,dummy}
The format of the model weights to load.
• “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if
safetensors format is not available.
• “pt” will load the weights in the pytorch bin format.
• “safetensors” will load the weights in the safetensors format.
• “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.
• “dummy” will initialize the weights with random values, mainly for profiling.
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.
• “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
• “half” for FP16. Recommended for AWQ quantization.
• “float16” is the same as “half”.
• “bfloat16” for a balance between precision and range.
• “float” is shorthand for FP32 precision.
• “float32” for FP32 precision.
--max-model-len <length>
Model context length. If unspecified, will be automatically derived from the model config.
--worker-use-ray
Use Ray for distributed serving, will be automatically set when using more than 1 GPU.
--pipeline-parallel-size (-pp) <size>
Number of pipeline stages.
--tensor-parallel-size (-tp) <size>
Number of tensor parallel replicas.

20 Chapter 1. Documentation
vLLM

--max-parallel-loading-workers <workers>
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
--block-size {8,16,32}
Token block size for contiguous chunks of tokens.
--seed <seed>
Random seed for operations.
--swap-space <size>
CPU swap space size (GiB) per GPU.
--gpu-memory-utilization <fraction>
The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a
value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
--max-num-batched-tokens <tokens>
Maximum number of batched tokens per iteration.
--max-num-seqs <sequences>
Maximum number of sequences per iteration.
--max-paddings <paddings>
Maximum number of paddings in a batch.
--disable-log-stats
Disable logging statistics.
--quantization (-q) {awq,squeezellm,None}
Method used to quantize the weights.

1.14 Using LoRA adapters

This document shows you how to use LoRA adapters with vLLM on top of a base model. Adapters can be efficiently
served on a per request basis with minimal overhead. First we download the adapter(s) and save them locally with

from huggingface_hub import snapshot_download

sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")

Then we instantiate the base model and pass in the enable_lora=True flag:

from vllm import LLM, SamplingParams


from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)

We can now submit the prompts and call llm.generate with the lora_request parameter. The first parameter of
LoRARequest is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third
parameter is the path to the LoRA adapter.

sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
(continues on next page)

1.14. Using LoRA adapters 21


vLLM

(continued from previous page)


stop=["[/assistant]"]
)

prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n␣
˓→context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name␣

˓→the ICAO for lilongwe international airport [/user] [assistant]",

"[user] Write a SQL query to answer the question based on the table schema.\n\n␣
˓→context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n␣

˓→question: When Anchero Pantaleone was the elector what is under nationality? [/user]␣

˓→[assistant]",

outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)

Check out examples/multilora_inference.py for an example of how to use LoRA adapters with the async engine and
how to use more advanced configuration options.

1.14.1 Serving LoRA Adapters

LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
--lora-modules {name}={path} {name}={path} to specify each LoRA module when we kickoff the server:

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-
˓→test/

The server entrypoint accepts all other LoRA configuration parameters (max_loras, max_lora_rank,
max_cpu_loras, etc.), which will apply to all forthcoming requests. Upon querying the /models endpoint,
we should see our LoRA along with its base model:

curl localhost:8000/v1/models | jq .
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "sql-lora",
"object": "model",
...
}
(continues on next page)

22 Chapter 1. Documentation
vLLM

(continued from previous page)


]
}

Requests can specify the LoRA adapter as if it were any other model via the model request parameter. The requests
will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and
potentially other LoRA adapter requests if they were provided and max_loras is set high enough).
The following is an example request

1.15 AutoAWQ

Warning: Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend
using the unquantized version of the model for better accuracy and higher throughput. Currently, you can use AWQ
as a way to reduce memory footprint. As of now, it is more suitable for low latency inference with small number of
concurrent requests. vLLM’s AWQ implementation have lower throughput than unquantized version.

To create a new 4-bit quantized model, you can leverage AutoAWQ. Quantizing reduces the model’s precision from
FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the 400+ models on Huggingface.

$ pip install autoawq

After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize Vicuna 7B v1.5:

from awq import AutoAWQForCausalLM


from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model


model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

To run an AWQ model with vLLM, you can use TheBloke/Llama-2-7b-Chat-AWQ with the following command:

$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --


˓→quantization awq

AWQ models are also supported directly through the LLM entrypoint:

1.15. AutoAWQ 23
vLLM

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

1.16 FP8 E5M2 KV Cache

The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU
memory benefits. The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot16 and fp8 to each
other.
Here is an example of how to enable this feature:

from vllm import LLM, SamplingParams


# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8_e5m2")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

24 Chapter 1. Documentation
vLLM

1.17 vLLM Engine

1.17.1 LLMEngine

1.17.2 AsyncLLMEngine

1.17. vLLM Engine 25


vLLM

26 Chapter 1. Documentation
CHAPTER

TWO

INDICES AND TABLES

• genindex
• modindex

27
vLLM

28 Chapter 2. Indices and tables


PYTHON MODULE INDEX

v
vllm.engine, 25

29
vLLM

30 Python Module Index


INDEX

Symbols command line option, 20


--block-size --trust-remote-code
command line option, 21 command line option, 20
--disable-log-stats --worker-use-ray
command line option, 21 command line option, 20
--download-dir
command line option, 20 C
--dtype command line option
command line option, 20 --block-size, 21
--gpu-memory-utilization --disable-log-stats, 21
command line option, 21 --download-dir, 20
--load-format --dtype, 20
command line option, 20 --gpu-memory-utilization, 21
--max-model-len --load-format, 20
command line option, 20 --max-model-len, 20
--max-num-batched-tokens --max-num-batched-tokens, 21
command line option, 21 --max-num-seqs, 21
--max-num-seqs --max-paddings, 21
command line option, 21 --max-parallel-loading-workers, 20
--max-paddings --model, 19
command line option, 21 --pipeline-parallel-size, 20
--max-parallel-loading-workers --quantization, 21
command line option, 20 --revision, 19
--model --seed, 21
command line option, 19 --swap-space, 21
--pipeline-parallel-size --tensor-parallel-size, 20
command line option, 20 --tokenizer, 19
--quantization --tokenizer-mode, 20
command line option, 21 --tokenizer-revision, 20
--revision --trust-remote-code, 20
command line option, 19 --worker-use-ray, 20
--seed
command line option, 21 M
--swap-space module
command line option, 21 vllm.engine, 25
--tensor-parallel-size
command line option, 20 V
--tokenizer vllm.engine
command line option, 19 module, 25
--tokenizer-mode
command line option, 20
--tokenizer-revision

31

You might also like