Docs VLLM Ai en Stable
Docs VLLM Ai en Stable
1 Documentation 3
Index 31
i
ii
vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
• State-of-the-art serving throughput
• Efficient management of attention key and value memory with PagedAttention
• Continuous batching of incoming requests
• Fast model execution with CUDA/HIP graph
• Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
• Optimized CUDA kernels
vLLM is flexible and easy to use with:
• Seamless integration with popular HuggingFace models
• High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
• Tensor parallelism support for distributed inference
• Streaming outputs
• OpenAI-compatible API server
• Support NVIDIA GPUs and AMD GPUs
• (Experimental) Prefix caching support
• (Experimental) Multi-lora support
For more information, check out the following:
• vLLM announcing blog post (intro to PagedAttention)
• vLLM paper (SOSP 2023)
• How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel
et al.
GETTING STARTED 1
vLLM
2 GETTING STARTED
CHAPTER
ONE
DOCUMENTATION
1.1 Installation
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.
1.1.1 Requirements
• OS: Linux
• Python: 3.8 – 3.11
• GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
Note: As of now, vLLM’s binaries are compiled on CUDA 12.1 by default. However, you can install vLLM with
CUDA 11.8 by running:
˓→whl
3
vLLM
Tip: If you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
Note: If you are developing the C++ backend of vLLM, consider building vLLM with
since it will give you incremental builds. The downside is that this method is deprecated by setuptools.
vLLM 0.2.4 onwards supports model inferencing and serving on AMD GPUs with ROCm. At the moment AWQ
quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported
in ROCm are FP16 and BF16.
1.2.1 Requirements
• OS: Linux
• Python: 3.8 – 3.11
• GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
• Pytorch 2.0.1/2.1.1/2.2
• ROCm 5.7 (Verified on python 3.10) or ROCm 6.0 (Verified on python 3.9)
Installation options:
1. (Recommended) Quick start with vLLM pre-installed in Docker Image
2. Build from source
3. Build from source with docker
4 Chapter 1. Documentation
vLLM
1.2.2 (Recommended) Option 1: Quick start with vLLM pre-installed in Docker Im-
age
Note:
• If you are using rocm5.7 with pytorch 2.1.0 onwards, you don’t need to apply the hipify_python.patch. You can
build the ROCm flash attention directly.
• If you fail to install ROCmSoftwarePlatform/flash-attention, try cloning from the commit
6fd2f8e572805681cd67ef8596c7e2ce521ed3c6.
• ROCm’s Flash-attention-2 (v2.0.4) does not support sliding windows attention.
• You might need to downgrade the “ninja” version to 1.10 it is not used when compiling flash-attention-2 (e.g.
pip install ninja==1.10.2.4)
2. Setup xformers==0.0.23 without dependencies, and apply patches to adapt for ROCm flash attention
3. Build vLLM.
$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install # This may take 5-10 minutes. Currently, `pip␣
˓→install .`` does not work for ROCm installation
-f Dockerfile.rocm -t vllm-rocm .
Alternatively, if you plan to install vLLM-ROCm on a local machine or start from a fresh docker image (e.g.
rocm/pytorch), you can follow the steps below:
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
• ROCm
6 Chapter 1. Documentation
vLLM
• Pytorch
• hipBLAS
1. Install flash attention for ROCm
Install ROCm’s flash attention (v2.0.4) following the instructions from ROCmSoftwarePlatform/flash-
attention
Note:
• If you are using rocm5.7 with pytorch 2.1.0 onwards, you don’t need to apply the hipify_python.patch. You can
build the ROCm flash attention directly.
• If you fail to install ROCmSoftwarePlatform/flash-attention, try cloning from the commit
6fd2f8e572805681cd67ef8596c7e2ce521ed3c6.
• ROCm’s Flash-attention-2 (v2.0.4) does not support sliding windows attention.
• You might need to downgrade the “ninja” version to 1.10 it is not used when compiling flash-attention-2 (e.g.
pip install ninja==1.10.2.4)
2. Setup xformers==0.0.23 without dependencies, and apply patches to adapt for ROCm flash attention
3. Build vLLM.
$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install # This may take 5-10 minutes.
Note:
• You may need to turn on the --enforce-eager flag if you experience process hang when running the bench-
mark_thoughput.py script to test your installation.
1.3 Quickstart
Note: By default, vLLM downloads model from HuggingFace. If you would like to use models from ModelScope in
the following examples, please set the environment variable:
export VLLM_USE_MODELSCOPE=True
1.3. Quickstart 7
vLLM
We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to
generate texts for a list of input prompts.
Import LLM and SamplingParams from vLLM. The LLM class is the main class for running offline inference with
vLLM engine. The SamplingParams class specifies the parameters for the sampling process.
Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and
the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the class
definition.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
Initialize vLLM’s engine for offline inference with the LLM class and the OPT-125M model. The list of supported
models can be found at supported models.
llm = LLM(model="facebook/opt-125m")
Call llm.generate to generate the outputs. It adds the input prompts to vLLM engine’s waiting queue and executes
the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of RequestOutput
objects, which include all the output tokens.
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in
replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can
specify the address with --host and --port arguments. The server currently hosts one model at a time (OPT-125M
in the command below) and implements list models, create chat completion, and create completion endpoints. We are
actively adding support for more endpoints.
Start the server:
$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m
8 Chapter 1. Documentation
vLLM
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using
the --chat-template argument:
$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m \
$ --chat-template ./examples/template_chatml.jinja
This server can be queried in the same format as OpenAI API. For example, list the models:
$ curl http://localhost:8000/v1/models
You can pass in the argument --api-key or environment variable VLLM_API_KEY to enable the server to check for
API key in the header.
$ curl http://localhost:8000/v1/completions \
$ -H "Content-Type: application/json" \
$ -d '{
$ "model": "facebook/opt-125m",
$ "prompt": "San Francisco is a",
$ "max_tokens": 7,
$ "temperature": 0
$ }'
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using
OpenAI API. For example, another way to query the server is via the openai python package:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="facebook/opt-125m",
prompt="San Francisco is a")
print("Completion result:", completion)
1.3. Quickstart 9
vLLM
The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations
with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-
forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed
explanations.
Querying the model using OpenAI Chat API:
You can use the create chat completion endpoint to communicate with the model in a chat-like interface:
$ curl http://localhost:8000/v1/chat/completions \
$ -H "Content-Type: application/json" \
$ -d '{
$ "model": "facebook/opt-125m",
$ "messages": [
$ {"role": "system", "content": "You are a helpful assistant."},
$ {"role": "user", "content": "Who won the world series in 2020?"}
$ ]
$ }'
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="facebook/opt-125m",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)
For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.
10 Chapter 1. Documentation
vLLM
vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel
algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with:
To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs
you want to use. For example, to run inference on 4 GPUs:
To run multi-GPU serving, pass in the --tensor-parallel-size argument when starting the server. For example,
to run API server on 4 GPUs:
$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ --tensor-parallel-size 4
To scale vLLM beyond a single machine, start a Ray runtime via CLI before running vLLM:
$ # On head node
$ ray start --head
$ # On worker nodes
$ ray start --address=<ray-head-address>
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node
by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.
vLLM can be run on the cloud to scale to multiple GPUs with SkyPilot, an open-source framework for running LLMs
on any cloud.
To install SkyPilot and setup your cloud credentials, run:
resources:
accelerators: A100
envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer
setup: |
conda create -n vllm python=3.9 -y
(continues on next page)
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in
your browser to use the LLaMA model to do the text completion.
Optional: Serve the 65B model instead of the default 13B and use more GPU:
vLLM can be deployed with KServe on Kubernetes for highly scalable distributed model serving.
Please see this guide for more details on using vLLM with KServe.
The Triton Inference Server hosts a tutorial demonstrating how to quickly deploy a simple facebook/opt-125m model
using vLLM. Please see Deploying a vLLM model in Triton for more details.
12 Chapter 1. Documentation
vLLM
vLLM offers official docker image for deployment. The image can be used to run OpenAI compatible server. The
image is available on Docker Hub as vllm/vllm-openai.
Note: You can either use the ipc=host flag or --shm-size flag to allow the container to access the host’s shared
memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly
for tensor parallel inference.
You can build and run vLLM from source via the provided dockerfile. To build vLLM:
Note: By default vLLM will build for all GPU types for widest distribution. If you are just building for the current GPU
type the machine is running on, you can add the argument --build-arg torch_cuda_arch_list="" for vLLM to
find the current GPU type and build for that.
To run vLLM:
To run inference on a single or multiple GPUs, use VLLM class from langchain.
llm = VLLM(model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
(continues on next page)
vLLM exposes a number of metrics that can be used to monitor the health of the system. These metrics are exposed
via the /metrics endpoint on the vLLM OpenAI compatible API server.
The following metrics are exposed:
class Metrics:
self.info_cache_config = Info(
name='vllm:cache_config',
documentation='information of cache_config')
# System stats
self.gauge_scheduler_running = Gauge(
name="vllm:num_requests_running",
documentation="Number of requests currently running on GPU.",
labelnames=labelnames)
self.gauge_scheduler_swapped = Gauge(
name="vllm:num_requests_swapped",
documentation="Number of requests swapped to CPU.",
labelnames=labelnames)
self.gauge_scheduler_waiting = Gauge(
name="vllm:num_requests_waiting",
documentation="Number of requests waiting to be processed.",
labelnames=labelnames)
self.gauge_gpu_cache_usage = Gauge(
name="vllm:gpu_cache_usage_perc",
documentation="GPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)
self.gauge_cpu_cache_usage = Gauge(
name="vllm:cpu_cache_usage_perc",
documentation="CPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)
14 Chapter 1. Documentation
vLLM
# Legacy metrics
self.gauge_avg_prompt_throughput = Gauge(
name="vllm:avg_prompt_throughput_toks_per_s",
documentation="Average prefill throughput in tokens/s.",
labelnames=labelnames,
)
self.gauge_avg_generation_throughput = Gauge(
name="vllm:avg_generation_throughput_toks_per_s",
documentation="Average generation throughput in tokens/s.",
labelnames=labelnames,
)
vLLM supports a variety of generative Transformer models in HuggingFace Transformers. The following is the list
of model architectures that are currently supported by vLLM. Alongside each architecture, we include some popular
models that use it.
16 Chapter 1. Documentation
vLLM
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM. Otherwise,
please refer to Adding a New Model for instructions on how to implement support for your model. Alternatively, you
can raise an issue on our GitHub project.
Note: Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
Tip: The easiest way to check if your model is supported is to run the program below:
Tip: To use models from ModelScope instead of HuggingFace Hub, set an environment variable:
$ export VLLM_USE_MODELSCOPE=True
This document provides a high-level guide on integrating a HuggingFace Transformers model into vLLM.
Note: The complexity of adding a new model depends heavily on the model’s architecture. The process is considerably
straightforward if the model shares a similar architecture with an existing model in vLLM. However, for models that
include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
Tip: If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our GitHub
repository. We will be happy to help you out!
Start by forking our GitHub repository and then build it from source. This gives you the ability to modify the codebase
and test your model.
Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the
vllm/model_executor/models directory. For instance, vLLM’s OPT model was adapted from the HuggingFace’s
modeling_opt.py file.
Warning: When copying the model code, make sure to review and adhere to the code’s copyright and licensing
terms.
Next, you need to rewrite the forward methods of your model by following these steps:
1. Remove any unnecessary code, such as the code only used for training.
2. Change the input parameters:
def forward(
self,
input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
-) -> Union[Tuple, CausalLMOutputWithPast]:
+ positions: torch.Tensor,
+ kv_caches: List[KVCache],
+ input_metadata: InputMetadata,
+) -> Optional[SamplerOutput]:
1. Update the code by considering that input_ids and positions are now flattened tensors.
2. Replace the attention operation with either PagedAttention, PagedAttentionWithRoPE, or
PagedAttentionWithALiBi depending on the model’s architecture.
Note: Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional
embeddings. If your model employs a different attention mechanism, you will need to implement a new attention layer
in vLLM.
18 Chapter 1. Documentation
vLLM
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it. To do this, substitute
your model’s linear and embedding layers with their tensor-parallel versions. For the embedding layer, you can simply
replace nn.Embedding with VocabParallelEmbedding. For the output LM head, you can use ParallelLMHead.
When it comes to the linear layers, we provide the following options to parallelize them:
• ReplicatedLinear: Replicates the inputs and weights across multiple GPUs. No memory saving.
• RowParallelLinear: The input tensor is partitioned along the hidden dimension. The weight matrix is parti-
tioned along the rows (input dimension). An all-reduce operation is performed after the matrix multiplication to
reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention
layer.
• ColumnParallelLinear: The input tensor is replicated. The weight matrix is partitioned along the columns
(output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer
and the separated QKV transformation of the attention layer in the original Transformer.
• MergedColumnParallelLinear: Column-parallel linear that merges multiple ColumnParallelLinear opera-
tors. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles
the sharded weight loading logic of multiple weight matrices.
• QKVParallelLinear: Parallel linear layer for the query, key, and value projections of the multi-head and
grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class
replicates the key/value heads properly. This class handles the weight loading and replication of the weight
matrices.
Note that all the linear layers above take linear_method as an input. vLLM will set this parameter according to different
quantization schemes to support weight quantization.
You now need to implement the load_weights method in your *ForCausalLM class. This method should load the
weights from the HuggingFace’s checkpoint file and assign them to the corresponding layers in your model. Specifically,
for MergedColumnParallelLinear and QKVParallelLinear layers, if the original model has separated weight matrices,
you need to load the different parts separately.
Below, you can find an explanation of every engine argument for vLLM:
--model <model_name_or_path>
Name or path of the huggingface model to use.
--tokenizer <tokenizer_name_or_path>
Name or path of the huggingface tokenizer to use.
--revision <revision>
The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use
the default version.
--tokenizer-revision <revision>
The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will
use the default version.
--tokenizer-mode {auto,slow}
The tokenizer mode.
• “auto” will use the fast tokenizer if available.
• “slow” will always use the slow tokenizer.
--trust-remote-code
Trust remote code from huggingface.
--download-dir <directory>
Directory to download and load the weights, default to the default cache dir of huggingface.
--load-format {auto,pt,safetensors,npcache,dummy}
The format of the model weights to load.
• “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if
safetensors format is not available.
• “pt” will load the weights in the pytorch bin format.
• “safetensors” will load the weights in the safetensors format.
• “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.
• “dummy” will initialize the weights with random values, mainly for profiling.
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.
• “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
• “half” for FP16. Recommended for AWQ quantization.
• “float16” is the same as “half”.
• “bfloat16” for a balance between precision and range.
• “float” is shorthand for FP32 precision.
• “float32” for FP32 precision.
--max-model-len <length>
Model context length. If unspecified, will be automatically derived from the model config.
--worker-use-ray
Use Ray for distributed serving, will be automatically set when using more than 1 GPU.
--pipeline-parallel-size (-pp) <size>
Number of pipeline stages.
--tensor-parallel-size (-tp) <size>
Number of tensor parallel replicas.
20 Chapter 1. Documentation
vLLM
--max-parallel-loading-workers <workers>
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
--block-size {8,16,32}
Token block size for contiguous chunks of tokens.
--seed <seed>
Random seed for operations.
--swap-space <size>
CPU swap space size (GiB) per GPU.
--gpu-memory-utilization <fraction>
The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a
value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
--max-num-batched-tokens <tokens>
Maximum number of batched tokens per iteration.
--max-num-seqs <sequences>
Maximum number of sequences per iteration.
--max-paddings <paddings>
Maximum number of paddings in a batch.
--disable-log-stats
Disable logging statistics.
--quantization (-q) {awq,squeezellm,None}
Method used to quantize the weights.
This document shows you how to use LoRA adapters with vLLM on top of a base model. Adapters can be efficiently
served on a per request basis with minimal overhead. First we download the adapter(s) and save them locally with
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
Then we instantiate the base model and pass in the enable_lora=True flag:
We can now submit the prompts and call llm.generate with the lora_request parameter. The first parameter of
LoRARequest is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third
parameter is the path to the LoRA adapter.
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
(continues on next page)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n␣
˓→context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name␣
"[user] Write a SQL query to answer the question based on the table schema.\n\n␣
˓→context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n␣
˓→question: When Anchero Pantaleone was the elector what is under nationality? [/user]␣
˓→[assistant]",
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)
Check out examples/multilora_inference.py for an example of how to use LoRA adapters with the async engine and
how to use more advanced configuration options.
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
--lora-modules {name}={path} {name}={path} to specify each LoRA module when we kickoff the server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-
˓→test/
The server entrypoint accepts all other LoRA configuration parameters (max_loras, max_lora_rank,
max_cpu_loras, etc.), which will apply to all forthcoming requests. Upon querying the /models endpoint,
we should see our LoRA along with its base model:
curl localhost:8000/v1/models | jq .
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "sql-lora",
"object": "model",
...
}
(continues on next page)
22 Chapter 1. Documentation
vLLM
Requests can specify the LoRA adapter as if it were any other model via the model request parameter. The requests
will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and
potentially other LoRA adapter requests if they were provided and max_loras is set high enough).
The following is an example request
1.15 AutoAWQ
Warning: Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend
using the unquantized version of the model for better accuracy and higher throughput. Currently, you can use AWQ
as a way to reduce memory footprint. As of now, it is more suitable for low latency inference with small number of
concurrent requests. vLLM’s AWQ implementation have lower throughput than unquantized version.
To create a new 4-bit quantized model, you can leverage AutoAWQ. Quantizing reduces the model’s precision from
FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the 400+ models on Huggingface.
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize Vicuna 7B v1.5:
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
To run an AWQ model with vLLM, you can use TheBloke/Llama-2-7b-Chat-AWQ with the following command:
AWQ models are also supported directly through the LLM entrypoint:
1.15. AutoAWQ 23
vLLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU
memory benefits. The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot16 and fp8 to each
other.
Here is an example of how to enable this feature:
24 Chapter 1. Documentation
vLLM
1.17.1 LLMEngine
1.17.2 AsyncLLMEngine
26 Chapter 1. Documentation
CHAPTER
TWO
• genindex
• modindex
27
vLLM
v
vllm.engine, 25
29
vLLM
31