0% found this document useful (0 votes)

339 views35 pages

Docs VLLM Ai en Stable

Uploaded by

Kunj Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

339 views35 pages

Docs VLLM Ai en Stable

Uploaded by

Kunj Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

vLLM

the vLLM Team

Mar 01, 2024

GETTING STARTED

1 Documentation 3

2 Indices and tables 27

Python Module Index 29

Index 31

i
ii
vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
• State-of-the-art serving throughput
• Efficient management of attention key and value memory with PagedAttention
• Continuous batching of incoming requests
• Fast model execution with CUDA/HIP graph
• Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
• Optimized CUDA kernels
vLLM is flexible and easy to use with:
• Seamless integration with popular HuggingFace models
• High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
• Tensor parallelism support for distributed inference
• Streaming outputs
• OpenAI-compatible API server
• Support NVIDIA GPUs and AMD GPUs
• (Experimental) Prefix caching support
• (Experimental) Multi-lora support
For more information, check out the following:
• vLLM announcing blog post (intro to PagedAttention)
• vLLM paper (SOSP 2023)
• How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel
et al.

GETTING STARTED 1
vLLM

2 GETTING STARTED
CHAPTER

ONE

DOCUMENTATION

1.1 Installation

vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.

1.1.1 Requirements

• OS: Linux
• Python: 3.8 – 3.11
• GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

1.1.2 Install with pip

You can install vLLM using pip:

$ # (Optional) Create a new conda environment.

$ conda create -n myenv python=3.9 -y
$ conda activate myenv

$ # Install vLLM with CUDA 12.1.

$ pip install vllm

Note: As of now, vLLM’s binaries are compiled on CUDA 12.1 by default. However, you can install vLLM with
CUDA 11.8 by running:

$ # Install vLLM with CUDA 11.8.

$ export VLLM_VERSION=0.2.4
$ export PYTHON_VERSION=39
$ pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/
˓→vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.

˓→whl

$ # Re-install PyTorch with CUDA 11.8.

$ pip uninstall torch -y
$ pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118

$ # Re-install xFormers with CUDA 11.8.

(continues on next page)

3
vLLM

(continued from previous page)

$ pip uninstall xformers -y
$ pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118

1.1.3 Build from source

You can also build and install vLLM from source:

$ git clone https://github.com/vllm-project/vllm.git

$ cd vllm
$ pip install -e . # This may take 5-10 minutes.

Tip: If you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.

$ # Use `--ipc=host` to make sure the shared memory is large enough.

$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

Note: If you are developing the C++ backend of vLLM, consider building vLLM with

$ python setup.py develop

since it will give you incremental builds. The downside is that this method is deprecated by setuptools.

1.2 Installation with ROCm

vLLM 0.2.4 onwards supports model inferencing and serving on AMD GPUs with ROCm. At the moment AWQ
quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported
in ROCm are FP16 and BF16.

1.2.1 Requirements

• OS: Linux
• Python: 3.8 – 3.11
• GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
• Pytorch 2.0.1/2.1.1/2.2
• ROCm 5.7 (Verified on python 3.10) or ROCm 6.0 (Verified on python 3.9)
Installation options:
1. (Recommended) Quick start with vLLM pre-installed in Docker Image
2. Build from source
3. Build from source with docker

4 Chapter 1. Documentation
vLLM

1.2.2 (Recommended) Option 1: Quick start with vLLM pre-installed in Docker Im-
age

This option is for ROCm 5.7 only:

$ docker pull embeddedllminfo/vllm-rocm:vllm-v0.2.4

$ docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
embeddedllminfo/vllm-rocm \
bash

1.2.3 Option 2: Build from source

You can build and install vLLM from source:

Below instruction is for ROCm 5.7 only. At the time of this documentation update, PyTorch on ROCm 6.0 wheel is not
yet available on the PyTorch website.
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
• ROCm
• Pytorch

$ pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.

˓→pytorch.org/whl/nightly/rocm5.7 # tested version

1. Install flash attention for ROCm

Install ROCm’s flash attention (v2.0.4) following the instructions from ROCmSoftwarePlatform/flash-
attention

Note:
• If you are using rocm5.7 with pytorch 2.1.0 onwards, you don’t need to apply the hipify_python.patch. You can
build the ROCm flash attention directly.
• If you fail to install ROCmSoftwarePlatform/flash-attention, try cloning from the commit
6fd2f8e572805681cd67ef8596c7e2ce521ed3c6.
• ROCm’s Flash-attention-2 (v2.0.4) does not support sliding windows attention.
• You might need to downgrade the “ninja” version to 1.10 it is not used when compiling flash-attention-2 (e.g.
pip install ninja==1.10.2.4)

2. Setup xformers==0.0.23 without dependencies, and apply patches to adapt for ROCm flash attention

$ pip install xformers==0.0.23 --no-deps

$ bash patch_xformers.rocm.sh

1.2. Installation with ROCm 5

vLLM

3. Build vLLM.

$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install # This may take 5-10 minutes. Currently, `pip␣
˓→install .`` does not work for ROCm installation

1.2.4 Option 3: Build from source with docker

You can build and install vLLM from source:

Build a docker image from Dockerfile.rocm, and launch a docker container.
The Dokerfile.rocm is designed to support both ROCm 5.7 and ROCm 6.0 and later versions. It provides flexibility to
customize the build of docker image using the following arguments:
• BASE_IMAGE: specifies the base image used when running docker build, specifically the Py-
Torch on ROCm base image. We have tested ROCm 5.7 and ROCm 6.0. The default is
rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1
• FX_GFX_ARCHS: specifies the GFX architecture that is used to build flash-attention, for example, gfx90a;gfx942
for MI200 and MI300. The default is gfx90a;gfx942
• FA_BRANCH: specifies the branch used to build the flash-attention in ROCmSoftwarePlatform’s flash-attention
repo. The default is 3d2b6f5
• BUILD_FA: specifies whether to build flash-attention. For Radeon RX 7900 series (gfx1100), this should be set
to 0 before flash-attention supports this target.
Their values can be passed in when running docker build with --build-arg options.
For example, to build docker image for vllm on ROCm 5.7, you can run:

$ docker build --build-arg BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.

˓→0.1" \

-f Dockerfile.rocm -t vllm-rocm .

To build vllm on ROCm 6.0, you can use the default:

$ docker build -f Dockerfile.rocm -t vllm-rocm .

Alternatively, if you plan to install vLLM-ROCm on a local machine or start from a fresh docker image (e.g.
rocm/pytorch), you can follow the steps below:
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
• ROCm

6 Chapter 1. Documentation
vLLM

• Pytorch
• hipBLAS
1. Install flash attention for ROCm
Install ROCm’s flash attention (v2.0.4) following the instructions from ROCmSoftwarePlatform/flash-
attention

2. Setup xformers==0.0.23 without dependencies, and apply patches to adapt for ROCm flash attention

$ pip install xformers==0.0.23 --no-deps

$ bash patch_xformers.rocm.sh

3. Build vLLM.

$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install # This may take 5-10 minutes.

Note:
• You may need to turn on the --enforce-eager flag if you experience process hang when running the bench-
mark_thoughput.py script to test your installation.

1.3 Quickstart

This guide shows how to use vLLM to:

• run offline batched inference on a dataset;
• build an API server for a large language model;
• start an OpenAI-compatible API server.
Be sure to complete the installation instructions before continuing with this guide.

Note: By default, vLLM downloads model from HuggingFace. If you would like to use models from ModelScope in
the following examples, please set the environment variable:

export VLLM_USE_MODELSCOPE=True

1.3. Quickstart 7
vLLM

1.3.1 Offline Batched Inference

We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to
generate texts for a list of input prompts.
Import LLM and SamplingParams from vLLM. The LLM class is the main class for running offline inference with
vLLM engine. The SamplingParams class specifies the parameters for the sampling process.

from vllm import LLM, SamplingParams

Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and
the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the class
definition.

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

Initialize vLLM’s engine for offline inference with the LLM class and the OPT-125M model. The list of supported
models can be found at supported models.

llm = LLM(model="facebook/opt-125m")

Call llm.generate to generate the outputs. It adds the input prompts to vLLM engine’s waiting queue and executes
the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of RequestOutput
objects, which include all the output tokens.

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.

for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The code example can also be found in examples/offline_inference.py.

1.3.2 OpenAI-Compatible Server

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in
replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can
specify the address with --host and --port arguments. The server currently hosts one model at a time (OPT-125M
in the command below) and implements list models, create chat completion, and create completion endpoints. We are
actively adding support for more endpoints.
Start the server:

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m

8 Chapter 1. Documentation
vLLM

By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using
the --chat-template argument:

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m \
$ --chat-template ./examples/template_chatml.jinja

This server can be queried in the same format as OpenAI API. For example, list the models:

$ curl http://localhost:8000/v1/models

You can pass in the argument --api-key or environment variable VLLM_API_KEY to enable the server to check for
API key in the header.

Using OpenAI Completions API with vLLM

Query the model with input prompts:

$ curl http://localhost:8000/v1/completions \
$ -H "Content-Type: application/json" \
$ -d '{
$ "model": "facebook/opt-125m",
$ "prompt": "San Francisco is a",
$ "max_tokens": 7,
$ "temperature": 0
$ }'

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using
OpenAI API. For example, another way to query the server is via the openai python package:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="facebook/opt-125m",
prompt="San Francisco is a")
print("Completion result:", completion)

For a more detailed client example, refer to examples/openai_completion_client.py.

1.3. Quickstart 9
vLLM

Using OpenAI Chat API with vLLM

The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations
with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-
forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed
explanations.
Querying the model using OpenAI Chat API:
You can use the create chat completion endpoint to communicate with the model in a chat-like interface:

$ curl http://localhost:8000/v1/chat/completions \
$ -H "Content-Type: application/json" \
$ -d '{
$ "model": "facebook/opt-125m",
$ "messages": [
$ {"role": "system", "content": "You are a helpful assistant."},
$ {"role": "user", "content": "Who won the world series in 2020?"}
$ ]
$ }'

Python Client Example:

Using the openai python package, you can also communicate with the model in a chat-like manner:

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
model="facebook/opt-125m",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)

For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.

10 Chapter 1. Documentation
vLLM

1.4 Distributed Inference and Serving

vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel
algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with:

$ pip install ray

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs
you want to use. For example, to run inference on 4 GPUs:

from vllm import LLM

llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

To run multi-GPU serving, pass in the --tensor-parallel-size argument when starting the server. For example,
to run API server on 4 GPUs:

$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ --tensor-parallel-size 4

To scale vLLM beyond a single machine, start a Ray runtime via CLI before running vLLM:

$ # On head node
$ ray start --head

$ # On worker nodes
$ ray start --address=<ray-head-address>

After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node
by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.

1.5 Running on clouds with SkyPilot

vLLM can be run on the cloud to scale to multiple GPUs with SkyPilot, an open-source framework for running LLMs
on any cloud.
To install SkyPilot and setup your cloud credentials, run:

$ pip install skypilot

$ sky check

See the vLLM SkyPilot YAML for serving, serving.yaml.

resources:
accelerators: A100

envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer

setup: |
conda create -n vllm python=3.9 -y
(continues on next page)

1.4. Distributed Inference and Serving 11

vLLM

(continued from previous page)

conda activate vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install .
pip install gradio

run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py

Start the serving the LLaMA-13B model on an A100 GPU:

$ sky launch serving.yaml

Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in
your browser to use the LLaMA model to do the text completion.

(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

Optional: Serve the 65B model instead of the default 13B and use more GPU:

sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-

˓→research/llama-65b-hf

1.6 Deploying with KServe

vLLM can be deployed with KServe on Kubernetes for highly scalable distributed model serving.
Please see this guide for more details on using vLLM with KServe.

1.7 Deploying with NVIDIA Triton

The Triton Inference Server hosts a tutorial demonstrating how to quickly deploy a simple facebook/opt-125m model
using vLLM. Please see Deploying a vLLM model in Triton for more details.

12 Chapter 1. Documentation
vLLM

1.8 Deploying with Docker

vLLM offers official docker image for deployment. The image can be used to run OpenAI compatible server. The
image is available on Docker Hub as vllm/vllm-openai.

$ docker run --runtime nvidia --gpus all \

-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1

Note: You can either use the ipc=host flag or --shm-size flag to allow the container to access the host’s shared
memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly
for tensor parallel inference.

You can build and run vLLM from source via the provided dockerfile. To build vLLM:

$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai #␣

˓→optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2

Note: By default vLLM will build for all GPU types for widest distribution. If you are just building for the current GPU
type the machine is running on, you can add the argument --build-arg torch_cuda_arch_list="" for vLLM to
find the current GPU type and build for that.

To run vLLM:

$ docker run --runtime nvidia --gpus all \

-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
vllm/vllm-openai <args...>

1.9 Serving with Langchain

vLLM is also available via Langchain .

To install langchain, run

$ pip install langchain langchain_community -q

To run inference on a single or multiple GPUs, use VLLM class from langchain.

from langchain_community.llms import VLLM

llm = VLLM(model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
(continues on next page)

1.8. Deploying with Docker 13

vLLM

(continued from previous page)

top_k=10,
top_p=0.95,
temperature=0.8,
# tensor_parallel_size=... # for distributed inference
)

print(llm("What is the capital of France ?"))

Please refer to this Tutorial for more details.

1.10 Production Metrics

vLLM exposes a number of metrics that can be used to monitor the health of the system. These metrics are exposed
via the /metrics endpoint on the vLLM OpenAI compatible API server.
The following metrics are exposed:

class Metrics:

def init(self, labelnames: List[str]):

# Unregister any existing vLLM collectors
for collector in list(REGISTRY._collector_to_names):
if hasattr(collector, "_name") and "vllm" in collector._name:
REGISTRY.unregister(collector)

self.info_cache_config = Info(
name='vllm:cache_config',
documentation='information of cache_config')

# System stats
self.gauge_scheduler_running = Gauge(
name="vllm:num_requests_running",
documentation="Number of requests currently running on GPU.",
labelnames=labelnames)
self.gauge_scheduler_swapped = Gauge(
name="vllm:num_requests_swapped",
documentation="Number of requests swapped to CPU.",
labelnames=labelnames)
self.gauge_scheduler_waiting = Gauge(
name="vllm:num_requests_waiting",
documentation="Number of requests waiting to be processed.",
labelnames=labelnames)
self.gauge_gpu_cache_usage = Gauge(
name="vllm:gpu_cache_usage_perc",
documentation="GPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)
self.gauge_cpu_cache_usage = Gauge(
name="vllm:cpu_cache_usage_perc",
documentation="CPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)

(continues on next page)

14 Chapter 1. Documentation
vLLM

(continued from previous page)

# Raw stats from last model iteration
self.counter_prompt_tokens = Counter(
name="vllm:prompt_tokens_total",
documentation="Number of prefill tokens processed.",
labelnames=labelnames)
self.counter_generation_tokens = Counter(
name="vllm:generation_tokens_total",
documentation="Number of generation tokens processed.",
labelnames=labelnames)
self.histogram_time_to_first_token = Histogram(
name="vllm:time_to_first_token_seconds",
documentation="Histogram of time to first token in seconds.",
labelnames=labelnames,
buckets=[
0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5,
0.75, 1.0, 2.5, 5.0, 7.5, 10.0
])
self.histogram_time_per_output_token = Histogram(
name="vllm:time_per_output_token_seconds",
documentation="Histogram of time per output token in seconds.",
labelnames=labelnames,
buckets=[
0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75,
1.0, 2.5
])
self.histogram_e2e_request_latency = Histogram(
name="vllm:e2e_request_latency_seconds",
documentation="Histogram of end to end request latency in seconds.",
labelnames=labelnames,
buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])

# Legacy metrics
self.gauge_avg_prompt_throughput = Gauge(
name="vllm:avg_prompt_throughput_toks_per_s",
documentation="Average prefill throughput in tokens/s.",
labelnames=labelnames,
)
self.gauge_avg_generation_throughput = Gauge(
name="vllm:avg_generation_throughput_toks_per_s",
documentation="Average generation throughput in tokens/s.",
labelnames=labelnames,
)

1.10. Production Metrics 15

vLLM

1.11 Supported Models

vLLM supports a variety of generative Transformer models in HuggingFace Transformers. The following is the list
of model architectures that are currently supported by vLLM. Alongside each architecture, we include some popular
models that use it.

Architecture Models Example HuggingFace Models

AquilaForCausalLM Aquila BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.
BaiChuanForCausalLM Baichuan baichuan-inc/Baichuan2-13B-Chat,
baichuan-inc/Baichuan-7B, etc.
ChatGLMModel ChatGLM THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.
DeciLMForCausalLM DeciLM Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.
BloomForCausalLM BLOOM, BLOOMZ, bigscience/bloom, bigscience/bloomz, etc.
BLOOMChat
FalconForCausalLM Falcon tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/
falcon-rw-7b, etc.
GemmaForCausalLM Gemma google/gemma-2b, google/gemma-7b, etc.
GPT2LMHeadModel GPT-2 gpt2, gpt2-xl, etc.
GPTBigCodeForCausalLM StarCoder, SantaCoder, bigcode/starcoder, bigcode/
WizardCoder gpt_bigcode-santacoder, WizardLM/
WizardCoder-15B-V1.0, etc.
GPTJForCausalLM GPT-J EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.
GPTNeoXForCausalLM GPT-NeoX, Pythia, Ope- EleutherAI/gpt-neox-20b, EleutherAI/
nAssistant, Dolly V2, Sta- pythia-12b, OpenAssistant/
bleLM oasst-sft-4-pythia-12b-epoch-3.5,
databricks/dolly-v2-12b, stabilityai/
stablelm-tuned-alpha-7b, etc.
InternLMForCausalLM InternLM internlm/internlm-7b, internlm/
internlm-chat-7b, etc.
InternLM2ForCausalLM InternLM2 internlm/internlm2-7b, internlm/
internlm2-chat-7b, etc.
LlamaForCausalLM LLaMA, LLaMA-2, Vi- meta-llama/Llama-2-13b-hf, meta-llama/
cuna, Alpaca, Yi Llama-2-70b-hf, openlm-research/
open_llama_13b, lmsys/vicuna-13b-v1.3,
01-ai/Yi-6B, 01-ai/Yi-34B, etc.
MistralForCausalLM Mistral, Mistral-Instruct mistralai/Mistral-7B-v0.1, mistralai/
Mistral-7B-Instruct-v0.1, etc.
MixtralForCausalLM Mixtral-8x7B, Mixtral- mistralai/Mixtral-8x7B-v0.1, mistralai/
8x7B-Instruct Mixtral-8x7B-Instruct-v0.1, etc.
MPTForCausalLM MPT, MPT-Instruct, MPT- mosaicml/mpt-7b, mosaicml/
Chat, MPT-StoryWriter mpt-7b-storywriter, mosaicml/mpt-30b, etc.
OLMoForCausalLM OLMo allenai/OLMo-1B, allenai/OLMo-7B, etc.
OPTForCausalLM OPT, OPT-IML facebook/opt-66b, facebook/opt-iml-max-30b,
etc.
OrionForCausalLM Orion OrionStarAI/Orion-14B-Base, OrionStarAI/
Orion-14B-Chat, etc.
PhiForCausalLM Phi microsoft/phi-1_5, microsoft/phi-2, etc.
QWenLMHeadModel Qwen Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.
Qwen2ForCausalLM Qwen2 Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat,
etc.
StableLmForCausalLM StableLM stabilityai/stablelm-3b-4e1t/ , stabilityai/
stablelm-base-alpha-7b-v2, etc.

16 Chapter 1. Documentation
vLLM

If your model uses one of the above model architectures, you can seamlessly run your model with vLLM. Otherwise,
please refer to Adding a New Model for instructions on how to implement support for your model. Alternatively, you
can raise an issue on our GitHub project.

Note: Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.

Tip: The easiest way to check if your model is supported is to run the program below:

from vllm import LLM

llm = LLM(model=...) # Name or path of your model

output = llm.generate("Hello, my name is")
print(output)

If vLLM successfully generates text, it indicates that your model is supported.

Tip: To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

$ export VLLM_USE_MODELSCOPE=True

And use with trust_remote_code=True.

from vllm import LLM

llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model

output = llm.generate("Hello, my name is")
print(output)

1.12 Adding a New Model

This document provides a high-level guide on integrating a HuggingFace Transformers model into vLLM.

Note: The complexity of adding a new model depends heavily on the model’s architecture. The process is considerably
straightforward if the model shares a similar architecture with an existing model in vLLM. However, for models that
include new operators (e.g., a new attention mechanism), the process can be a bit more complex.

Tip: If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our GitHub
repository. We will be happy to help you out!

1.12. Adding a New Model 17

vLLM

1.12.1 0. Fork the vLLM repository

Start by forking our GitHub repository and then build it from source. This gives you the ability to modify the codebase
and test your model.

1.12.2 1. Bring your model code

Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the
vllm/model_executor/models directory. For instance, vLLM’s OPT model was adapted from the HuggingFace’s
modeling_opt.py file.

Warning: When copying the model code, make sure to review and adhere to the code’s copyright and licensing
terms.

1.12.3 2. Rewrite the forward methods

Next, you need to rewrite the forward methods of your model by following these steps:
1. Remove any unnecessary code, such as the code only used for training.
2. Change the input parameters:

def forward(
self,
input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
-) -> Union[Tuple, CausalLMOutputWithPast]:
+ positions: torch.Tensor,
+ kv_caches: List[KVCache],
+ input_metadata: InputMetadata,
+) -> Optional[SamplerOutput]:

1. Update the code by considering that input_ids and positions are now flattened tensors.
2. Replace the attention operation with either PagedAttention, PagedAttentionWithRoPE, or
PagedAttentionWithALiBi depending on the model’s architecture.

Note: Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional
embeddings. If your model employs a different attention mechanism, you will need to implement a new attention layer
in vLLM.

18 Chapter 1. Documentation
vLLM

1.12.4 3. (Optional) Implement tensor parallelism and quantization support

If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it. To do this, substitute
your model’s linear and embedding layers with their tensor-parallel versions. For the embedding layer, you can simply
replace nn.Embedding with VocabParallelEmbedding. For the output LM head, you can use ParallelLMHead.
When it comes to the linear layers, we provide the following options to parallelize them:
• ReplicatedLinear: Replicates the inputs and weights across multiple GPUs. No memory saving.
• RowParallelLinear: The input tensor is partitioned along the hidden dimension. The weight matrix is parti-
tioned along the rows (input dimension). An all-reduce operation is performed after the matrix multiplication to
reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention
layer.
• ColumnParallelLinear: The input tensor is replicated. The weight matrix is partitioned along the columns
(output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer
and the separated QKV transformation of the attention layer in the original Transformer.
• MergedColumnParallelLinear: Column-parallel linear that merges multiple ColumnParallelLinear opera-
tors. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles
the sharded weight loading logic of multiple weight matrices.
• QKVParallelLinear: Parallel linear layer for the query, key, and value projections of the multi-head and
grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class
replicates the key/value heads properly. This class handles the weight loading and replication of the weight
matrices.
Note that all the linear layers above take linear_method as an input. vLLM will set this parameter according to different
quantization schemes to support weight quantization.

1.12.5 4. Implement the weight loading logic

You now need to implement the load_weights method in your *ForCausalLM class. This method should load the
weights from the HuggingFace’s checkpoint file and assign them to the corresponding layers in your model. Specifically,
for MergedColumnParallelLinear and QKVParallelLinear layers, if the original model has separated weight matrices,
you need to load the different parts separately.

1.12.6 5. Register your model

Finally, include your *ForCausalLM class in vllm/model_executor/models/init.py and register it to the

_MODEL_REGISTRY in vllm/model_executor/model_loader.py.

1.13 Engine Arguments

Below, you can find an explanation of every engine argument for vLLM:
--model <model_name_or_path>
Name or path of the huggingface model to use.
--tokenizer <tokenizer_name_or_path>
Name or path of the huggingface tokenizer to use.

1.13. Engine Arguments 19

vLLM

--revision <revision>
The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use
the default version.
--tokenizer-revision <revision>
The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will
use the default version.
--tokenizer-mode {auto,slow}
The tokenizer mode.
• “auto” will use the fast tokenizer if available.
• “slow” will always use the slow tokenizer.
--trust-remote-code
Trust remote code from huggingface.
--download-dir <directory>
Directory to download and load the weights, default to the default cache dir of huggingface.
--load-format {auto,pt,safetensors,npcache,dummy}
The format of the model weights to load.
• “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if
safetensors format is not available.
• “pt” will load the weights in the pytorch bin format.
• “safetensors” will load the weights in the safetensors format.
• “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.
• “dummy” will initialize the weights with random values, mainly for profiling.
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.
• “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
• “half” for FP16. Recommended for AWQ quantization.
• “float16” is the same as “half”.
• “bfloat16” for a balance between precision and range.
• “float” is shorthand for FP32 precision.
• “float32” for FP32 precision.
--max-model-len <length>
Model context length. If unspecified, will be automatically derived from the model config.
--worker-use-ray
Use Ray for distributed serving, will be automatically set when using more than 1 GPU.
--pipeline-parallel-size (-pp) <size>
Number of pipeline stages.
--tensor-parallel-size (-tp) <size>
Number of tensor parallel replicas.

20 Chapter 1. Documentation
vLLM

--max-parallel-loading-workers <workers>
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
--block-size {8,16,32}
Token block size for contiguous chunks of tokens.
--seed <seed>
Random seed for operations.
--swap-space <size>
CPU swap space size (GiB) per GPU.
--gpu-memory-utilization <fraction>
The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a
value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
--max-num-batched-tokens <tokens>
Maximum number of batched tokens per iteration.
--max-num-seqs <sequences>
Maximum number of sequences per iteration.
--max-paddings <paddings>
Maximum number of paddings in a batch.
--disable-log-stats
Disable logging statistics.
--quantization (-q) {awq,squeezellm,None}
Method used to quantize the weights.

1.14 Using LoRA adapters

This document shows you how to use LoRA adapters with vLLM on top of a base model. Adapters can be efficiently
served on a per request basis with minimal overhead. First we download the adapter(s) and save them locally with

from huggingface_hub import snapshot_download

sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")

Then we instantiate the base model and pass in the enable_lora=True flag:

from vllm import LLM, SamplingParams

from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)

We can now submit the prompts and call llm.generate with the lora_request parameter. The first parameter of
LoRARequest is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third
parameter is the path to the LoRA adapter.

sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
(continues on next page)

1.14. Using LoRA adapters 21

vLLM

(continued from previous page)

stop=["[/assistant]"]
)

prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n␣
˓→context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name␣

˓→the ICAO for lilongwe international airport [/user] [assistant]",

"[user] Write a SQL query to answer the question based on the table schema.\n\n␣
˓→context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n␣

˓→question: When Anchero Pantaleone was the elector what is under nationality? [/user]␣

˓→[assistant]",

outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)

Check out examples/multilora_inference.py for an example of how to use LoRA adapters with the async engine and
how to use more advanced configuration options.

1.14.1 Serving LoRA Adapters

LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
--lora-modules {name}={path} {name}={path} to specify each LoRA module when we kickoff the server:

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-
˓→test/

The server entrypoint accepts all other LoRA configuration parameters (max_loras, max_lora_rank,
max_cpu_loras, etc.), which will apply to all forthcoming requests. Upon querying the /models endpoint,
we should see our LoRA along with its base model:

curl localhost:8000/v1/models | jq .
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "sql-lora",
"object": "model",
...
}
(continues on next page)

22 Chapter 1. Documentation
vLLM

(continued from previous page)

]
}

Requests can specify the LoRA adapter as if it were any other model via the model request parameter. The requests
will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and
potentially other LoRA adapter requests if they were provided and max_loras is set high enough).
The following is an example request

1.15 AutoAWQ

Warning: Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend
using the unquantized version of the model for better accuracy and higher throughput. Currently, you can use AWQ
as a way to reduce memory footprint. As of now, it is more suitable for low latency inference with small number of
concurrent requests. vLLM’s AWQ implementation have lower throughput than unquantized version.

To create a new 4-bit quantized model, you can leverage AutoAWQ. Quantizing reduces the model’s precision from
FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the 400+ models on Huggingface.

$ pip install autoawq

After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize Vicuna 7B v1.5:

from awq import AutoAWQForCausalLM

from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

To run an AWQ model with vLLM, you can use TheBloke/Llama-2-7b-Chat-AWQ with the following command:

$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --

˓→quantization awq

AWQ models are also supported directly through the LLM entrypoint:

1.15. AutoAWQ 23
vLLM

from vllm import LLM, SamplingParams

# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

1.16 FP8 E5M2 KV Cache

The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU
memory benefits. The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot16 and fp8 to each
other.
Here is an example of how to enable this feature:

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8_e5m2")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

24 Chapter 1. Documentation
vLLM

1.17 vLLM Engine

1.17.1 LLMEngine

1.17.2 AsyncLLMEngine

1.17. vLLM Engine 25

vLLM

26 Chapter 1. Documentation
CHAPTER

TWO

INDICES AND TABLES

• genindex
• modindex

27
vLLM

28 Chapter 2. Indices and tables

PYTHON MODULE INDEX

v
vllm.engine, 25

29
vLLM

30 Python Module Index

INDEX

Symbols command line option, 20

--block-size --trust-remote-code
command line option, 21 command line option, 20
--disable-log-stats --worker-use-ray
command line option, 21 command line option, 20
--download-dir
command line option, 20 C
--dtype command line option
command line option, 20 --block-size, 21
--gpu-memory-utilization --disable-log-stats, 21
command line option, 21 --download-dir, 20
--load-format --dtype, 20
command line option, 20 --gpu-memory-utilization, 21
--max-model-len --load-format, 20
command line option, 20 --max-model-len, 20
--max-num-batched-tokens --max-num-batched-tokens, 21
command line option, 21 --max-num-seqs, 21
--max-num-seqs --max-paddings, 21
command line option, 21 --max-parallel-loading-workers, 20
--max-paddings --model, 19
command line option, 21 --pipeline-parallel-size, 20
--max-parallel-loading-workers --quantization, 21
command line option, 20 --revision, 19
--model --seed, 21
command line option, 19 --swap-space, 21
--pipeline-parallel-size --tensor-parallel-size, 20
command line option, 20 --tokenizer, 19
--quantization --tokenizer-mode, 20
command line option, 21 --tokenizer-revision, 20
--revision --trust-remote-code, 20
command line option, 19 --worker-use-ray, 20
--seed
command line option, 21 M
--swap-space module
command line option, 21 vllm.engine, 25
--tensor-parallel-size
command line option, 20 V
--tokenizer vllm.engine
command line option, 19 module, 25
--tokenizer-mode
command line option, 20
--tokenizer-revision

Docs VLLM Ai en v0.6.1
No ratings yet
Docs VLLM Ai en v0.6.1
215 pages
Understanding The LLM Inference Workload
No ratings yet
Understanding The LLM Inference Workload
63 pages
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
写给系统管理员的Python脚本编程指南: Chinese Edition
From Everand
写给系统管理员的Python脚本编程指南: Chinese Edition
Posts & Telecom Press
No ratings yet
2025-03-06 - VLLM Office Hours - VLLM Production Stack
No ratings yet
2025-03-06 - VLLM Office Hours - VLLM Production Stack
52 pages
Stas Bekman - Machine Learning Engineering
100% (1)
Stas Bekman - Machine Learning Engineering
217 pages
Mastering KVM Virtualization
From Everand
Mastering KVM Virtualization
Humble Devassy Chirammal
5/5 (1)
Unlocking LLM Performance With Ebpf Optimizing Training and Inference Pipelines Chuan Hui Ebpfji Xi Llmxia Daep Xiao Zhen Relia Fa Qiu Yang Xiang Yunshan Networks Inc 1
No ratings yet
Unlocking LLM Performance With Ebpf Optimizing Training and Inference Pipelines Chuan Hui Ebpfji Xi Llmxia Daep Xiao Zhen Relia Fa Qiu Yang Xiang Yunshan Networks Inc 1
37 pages
Jetpack Compose 1.4 Essentials: Developing Android Apps with Jetpack Compose 1.4, Android Studio, and Kotlin
From Everand
Jetpack Compose 1.4 Essentials: Developing Android Apps with Jetpack Compose 1.4, Android Studio, and Kotlin
Smyth
5/5 (1)
Docker: Zero to Production Hero - A Complete Guide
From Everand
Docker: Zero to Production Hero - A Complete Guide
Nathan Reed
No ratings yet
The Complete Guide to Installing Parrot OS
From Everand
The Complete Guide to Installing Parrot OS
mehul kothari
No ratings yet
V LLM
No ratings yet
V LLM
5 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
261 pages
FCKSCRBD
No ratings yet
FCKSCRBD
13 pages
Docker for Beginner: Practical Guide to Containerization Mastery
From Everand
Docker for Beginner: Practical Guide to Containerization Mastery
Nolan Vautrin
No ratings yet
Eden Net GSM Anr Guide
No ratings yet
Eden Net GSM Anr Guide
101 pages
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
Proxmox High Availability
From Everand
Proxmox High Availability
Simon M.C. Cheng
No ratings yet
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
From Everand
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
DAISY JOHNSTON
No ratings yet
Linux - a Secure Personal Computer for Beginners. Second Edition
From Everand
Linux - a Secure Personal Computer for Beginners. Second Edition
Mark Emerson
No ratings yet
Mastering Proxmox
From Everand
Mastering Proxmox
Wasim Ahmed
5/5 (1)
Slide 3.2 Virtual Machines and Containers - PPTX 1 1
No ratings yet
Slide 3.2 Virtual Machines and Containers - PPTX 1 1
31 pages
Mastering Go Network Automation
From Everand
Mastering Go Network Automation
Ian Taylor
No ratings yet
Enterprise-Grade On-Premises LLM Inference Server
No ratings yet
Enterprise-Grade On-Premises LLM Inference Server
5 pages
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
GIScience2013 Week13a
No ratings yet
GIScience2013 Week13a
61 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
An Introduction To Fine-Tuning LLMs at Home With Axolotl #4 - The Register
No ratings yet
An Introduction To Fine-Tuning LLMs at Home With Axolotl #4 - The Register
3 pages
Achieve Better Large Language Model Inference With Fewer GPUs
No ratings yet
Achieve Better Large Language Model Inference With Fewer GPUs
9 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Filesm
No ratings yet
Filesm
5 pages
Script Freebtc
64% (14)
Script Freebtc
2 pages
Red Hat Enterprise Linux 9 Essentials: Learn to Install, Administer, and Deploy RHEL 9 Systems
From Everand
Red Hat Enterprise Linux 9 Essentials: Learn to Install, Administer, and Deploy RHEL 9 Systems
Smyth
No ratings yet
Arch Linux: Fast and Light!
From Everand
Arch Linux: Fast and Light!
Frank Cheung
3/5 (2)
SM5100 SM EN 2nd PDF
No ratings yet
SM5100 SM EN 2nd PDF
54 pages
Native Docker Clustering with Swarm
From Everand
Native Docker Clustering with Swarm
Fabrizio Soppelsa
No ratings yet
Ch02v4
No ratings yet
Ch02v4
95 pages
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
From Everand
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
Jordan Lioy
No ratings yet
System Reference Manual Beagleboard - Beaglebone-Ai Wiki GitHub PDF
No ratings yet
System Reference Manual Beagleboard - Beaglebone-Ai Wiki GitHub PDF
167 pages
Extending Docker
From Everand
Extending Docker
Russ McKendrick
5/5 (1)
VLLM: Using PagedAttention To Optimize LLM Inference and Serving
No ratings yet
VLLM: Using PagedAttention To Optimize LLM Inference and Serving
6 pages
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
From Everand
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
Ian Taylor
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mamood Alassouli
No ratings yet
Build Your First Home Server
From Everand
Build Your First Home Server
R.R. Arnob
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
jOOQ Manual 3.8
No ratings yet
jOOQ Manual 3.8
256 pages
Professional Node.js: Building Javascript Based Scalable Software
From Everand
Professional Node.js: Building Javascript Based Scalable Software
Pedro Teixeira
No ratings yet
Pos System Sample Thesis
100% (3)
Pos System Sample Thesis
7 pages
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
From Everand
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
Andrew Lee
3/5 (2)
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
Learn Docker - .NET Core, Java, Node.JS, PHP or Python: Learn Collection
From Everand
Learn Docker - .NET Core, Java, Node.JS, PHP or Python: Learn Collection
Arnaud Weil
5/5 (4)
vSphere 5 AutoLab 1.1a Deployment Guide
From Everand
vSphere 5 AutoLab 1.1a Deployment Guide
Alastair Cooke
No ratings yet
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Multi Target Tracking
No ratings yet
Multi Target Tracking
6 pages
Hallo Docker: Learning Docker Containers by Doing Projects
From Everand
Hallo Docker: Learning Docker Containers by Doing Projects
Agus Kurniawan
No ratings yet
Jetpack Compose 1.3 Essentials: Developing Android Apps with Jetpack Compose 1.3, Android Studio, and Kotlin
From Everand
Jetpack Compose 1.3 Essentials: Developing Android Apps with Jetpack Compose 1.3, Android Studio, and Kotlin
Neil Smyth
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
Network Security All-in-one: ASA Firepower WSA Umbrella VPN ISE Layer 2 Security
From Everand
Network Security All-in-one: ASA Firepower WSA Umbrella VPN ISE Layer 2 Security
Redouane MEDDANE
No ratings yet
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
From Everand
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
Arnaud Weil
No ratings yet
Node.js: Tools & Skills
From Everand
Node.js: Tools & Skills
James Hibbard
No ratings yet
Aeronautical Development Establishment (ADE) (DRDO Institute)
No ratings yet
Aeronautical Development Establishment (ADE) (DRDO Institute)
9 pages
DBMT Notes Online
No ratings yet
DBMT Notes Online
16 pages
Usability Engineering IRCTC UI
No ratings yet
Usability Engineering IRCTC UI
12 pages
Linux DevOps Tools Engineer (701) Practice Tests: 400 Questions to Ace Your Certification
From Everand
Linux DevOps Tools Engineer (701) Practice Tests: 400 Questions to Ace Your Certification
Steve Brown
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Guide To DevOps
No ratings yet
Guide To DevOps
26 pages
Researchdemo 1
No ratings yet
Researchdemo 1
11 pages
Objective: This Is A Preliminary Course For The Basic Courses in Mathematics Like, Abstract
No ratings yet
Objective: This Is A Preliminary Course For The Basic Courses in Mathematics Like, Abstract
5 pages
Binary Tutorial
No ratings yet
Binary Tutorial
10 pages
Cs411 Assignment Solution 2025
No ratings yet
Cs411 Assignment Solution 2025
3 pages
Impact CAD Brochure - English
No ratings yet
Impact CAD Brochure - English
6 pages
Evaluation of Some Intrusion Detection and Vulnerability Assessment Tools
From Everand
Evaluation of Some Intrusion Detection and Vulnerability Assessment Tools
Dr. Hedaya Mahmood Alasooly
No ratings yet
Planning Tools Sap Integrated Planning Ip
No ratings yet
Planning Tools Sap Integrated Planning Ip
10 pages
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mahmood Alassouli
No ratings yet
Presenting DeepSeek-Coder
No ratings yet
Presenting DeepSeek-Coder
2 pages
Nathaniel Brandon Cei 6 Stalpi Ai Increderii in Sine
No ratings yet
Nathaniel Brandon Cei 6 Stalpi Ai Increderii in Sine
179 pages
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
Instakart Axis Deposit Slip (Client Copy) Date of Deposition: Deposit Slip No: 8603388
No ratings yet
Instakart Axis Deposit Slip (Client Copy) Date of Deposition: Deposit Slip No: 8603388
2 pages
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet
Testtti - Google Search
100% (1)
Testtti - Google Search
2 pages
DRL PFB Usb DLL
No ratings yet
DRL PFB Usb DLL
2 pages
Mid Term Exam Questioner
No ratings yet
Mid Term Exam Questioner
4 pages
Leica M10 / Leica M10 "Edition Zagato" Leica M10-P "Asc 100 Edition"
No ratings yet
Leica M10 / Leica M10 "Edition Zagato" Leica M10-P "Asc 100 Edition"
4 pages
Software Management: Purpose of A Software Development Plan (SDP)
No ratings yet
Software Management: Purpose of A Software Development Plan (SDP)
4 pages
Mircom DH24120FPC Data Sheet
No ratings yet
Mircom DH24120FPC Data Sheet
2 pages
Vansh Tomar: Links Internship and Training
No ratings yet
Vansh Tomar: Links Internship and Training
1 page
Evaluation of Some Windows and Linux Intrusion Detection Tools
From Everand
Evaluation of Some Windows and Linux Intrusion Detection Tools
Dr. Hedaya Alasooly
No ratings yet
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
From Everand
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
Dr. Hidaia Mahmood Alassouli
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet

Docs VLLM Ai en Stable

Uploaded by

Docs VLLM Ai en Stable

Uploaded by

vLLM

the vLLM Team

Mar 01, 2024

2 Indices and tables 27

Python Module Index 29

1.1.2 Install with pip

You can install vLLM using pip:

$ # (Optional) Create a new conda environment.

$ # Install vLLM with CUDA 12.1.

$ # Install vLLM with CUDA 11.8.

$ # Re-install PyTorch with CUDA 11.8.

$ # Re-install xFormers with CUDA 11.8.

(continued from previous page)

1.1.3 Build from source

You can also build and install vLLM from source:

$ git clone https://github.com/vllm-project/vllm.git

$ # Use `--ipc=host` to make sure the shared memory is large enough.

$ python setup.py develop

1.2 Installation with ROCm

This option is for ROCm 5.7 only:

$ docker pull embeddedllminfo/vllm-rocm:vllm-v0.2.4

1.2.3 Option 2: Build from source

You can build and install vLLM from source:

$ pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.

1. Install flash attention for ROCm

$ pip install xformers==0.0.23 --no-deps

1.2. Installation with ROCm 5

1.2.4 Option 3: Build from source with docker

You can build and install vLLM from source:

$ docker build --build-arg BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.

To build vllm on ROCm 6.0, you can use the default:

$ docker build -f Dockerfile.rocm -t vllm-rocm .

$ pip install xformers==0.0.23 --no-deps

This guide shows how to use vLLM to:

1.3.1 Offline Batched Inference

from vllm import LLM, SamplingParams

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.

The code example can also be found in examples/offline_inference.py.

1.3.2 OpenAI-Compatible Server

Using OpenAI Completions API with vLLM

Query the model with input prompts:

from openai import OpenAI

For a more detailed client example, refer to examples/openai_completion_client.py.

Using OpenAI Chat API with vLLM

Python Client Example:

from openai import OpenAI

1.4 Distributed Inference and Serving

$ pip install ray

from vllm import LLM

1.5 Running on clouds with SkyPilot

$ pip install skypilot

See the vLLM SkyPilot YAML for serving, serving.yaml.

1.4. Distributed Inference and Serving 11

(continued from previous page)

Start the serving the LLaMA-13B model on an A100 GPU:

$ sky launch serving.yaml

(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-

1.6 Deploying with KServe

1.7 Deploying with NVIDIA Triton

1.8 Deploying with Docker

$ docker run --runtime nvidia --gpus all \

$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai #␣

$ docker run --runtime nvidia --gpus all \

1.9 Serving with Langchain

vLLM is also available via Langchain .

$ pip install langchain langchain_community -q

from langchain_community.llms import VLLM

1.8. Deploying with Docker 13

(continued from previous page)

print(llm("What is the capital of France ?"))

Please refer to this Tutorial for more details.

1.10 Production Metrics

def __init__(self, labelnames: List[str]):

(continues on next page)

def init(self, labelnames: List[str]):

Finally, include your *ForCausalLM class in vllm/model_executor/models/init.py and register it to the