0% found this document useful (0 votes)

134 views4 pages

Optimizing Inference Server For Maximum Tokens - Sec

This document outlines strategies to optimize an inference server for maximum tokens per second for a 70B parameter model within a $2,000 budget. Key recommendations include using an NVIDIA RTX 4090 GPU, implementing aggressive quantization, and selecting efficient inference frameworks like vLLM and TensorRT-LLM. Additional steps involve configuring the system, deploying the model, and monitoring performance to fine-tune parameters for optimal results.

Uploaded by

manifestationkzx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views4 pages

Optimizing Inference Server For Maximum Tokens - Sec

Uploaded by

manifestationkzx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Optimizing Inference Server for Maximum

Tokens/Second
Since your inference server will only be queried by one person at a time, we can focus on
maximizing raw inference performance rather than handling concurrent requests. Here are the
key strategies to achieve the highest tokens per second for your 70B parameter model within a
$2,000 budget.

Hardware Recommendations

GPU Selection
For a sub-$2,000 budget, your best option is the NVIDIA RTX 4090 (24GB VRAM):
Offers excellent performance/price ratio for consumer cards
24GB VRAM is sufficient for running quantized 70B models
Significantly more affordable than data center GPUs like A100 or T4 [1]

System Configuration
CPU: AMD Ryzen 9 7950X or Intel i9-13900K (sufficient for supporting the GPU)
RAM: 64GB DDR5 (minimum for handling 70B parameter models)
Storage: 2TB NVMe SSD (for model storage and fast loading)
Power Supply: 1000W Gold/Platinum rated (to handle GPU power requirements)

Software Optimization Techniques

1. Quantization
Since maximizing tokens/second is your priority, aggressive quantization is essential:
Implement 4-bit quantization (AWQ or GPTQ) to reduce memory requirements
This can provide up to 1.7x speedup compared to 8-bit quantization with minimal accuracy
loss [2]

# Example of loading with 4-bit quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
quantization_config=quantization_config,
device_map="auto"
)

2. Inference Framework Selection

For single-user, maximum throughput scenarios, these frameworks perform best:
vLLM: Consistently delivers high decoding speed and is well-supported [3]
LMDeploy: Can achieve up to 4000 tokens per second in optimal conditions [3]
TensorRT-LLM: Excellent for NVIDIA GPUs with comprehensive optimizations [2]

3. KV Cache Optimization
Implement efficient KV cache management to maximize throughput:
Use PagedAttention (as implemented in vLLM) for memory-efficient KV cache
If your use case involves repetitive prompts, implement prefix-caching strategies [2]

4. Model Serving Configuration

Set up NVIDIA Triton Inference Server with these optimizations:
Implement dynamic batching even for single-user scenarios (batching tokens, not requests)
[4]

Configure operator fusion to improve latency [5]

Use TensorRT model optimization to potentially double throughput [5]

Implementation Steps
1. Install Ubuntu 22.04 LTS with the latest NVIDIA drivers
2. Set up CUDA environment:

sudo apt install -y nvidia-driver-535

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2
sudo sh cuda_12.2.0_535.54.03_linux.run

3. Install Docker and NVIDIA Container Toolkit:

curl -fsSL https://get.docker.com -o get-docker.sh

sudo sh get-docker.sh
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

4. Deploy vLLM for maximum performance:

# Pull and run vLLM container

docker run --gpus all --shm-size 1g -p 8000:8000 -v /path/to/models:/models ghcr.io/vllm-

# Start vLLM server with optimized settings

python -m vllm.entrypoints.api_server \
--model /models/llama-2-70b-4bit \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192 \
--quantization awq

5. Alternative: Use TensorRT-LLM for potentially higher performance:

# Build TensorRT engine

trtllm-build --model_path /models/llama-2-70b \
--output_path /models/llama-2-70b-engine \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 2048 \
--max_output_len 512

# Run inference server

docker run --gpus all -it -p 8000:8000 -v /models:/models \
nvcr.io/nvidia/tritonserver:24.02-py3 tritonserver \
--model-repository=/models

Performance Monitoring and Tuning

1. Monitor GPU utilization:

nvidia-smi dmon -s u

2. Measure tokens per second:

import time
from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b")
prompt = "Write a detailed essay about artificial intelligence."

start_time = time.time()
response = requests.post("http://localhost:8000/generate",
json={"prompt": prompt, "max_tokens": 1000})
end_time = time.time()

output = response.json()["text"]
tokens = len(tokenizer.encode(output))
tokens_per_second = tokens / (end_time - start_time)
print(f"Tokens per second: {tokens_per_second}")

3. Fine-tune parameters based on your specific model and hardware:

Adjust tensor parallelism settings if needed
Experiment with different quantization methods
Optimize prompt processing
By implementing these strategies, you should be able to achieve the maximum possible tokens
per second for a 70B parameter model on consumer hardware within your $2,000 budget.
⁂

1. https://www.trgdatacenters.com/resource/gpu-for-inference/
2. https://tensorfuse.io/blog/llm-throughput-vllm-vs-sglang
3. https://www.bentoml.com/blog/benchmarking-llm-inference-backends
4. https://aws.amazon.com/blogs/machine-learning/achieve-hyperscale-performance-for-model-serving-
using-nvidia-triton-inference-server-on-amazon-sagemaker/
5. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/optimization.
html

Understanding The LLM Inference Workload
No ratings yet
Understanding The LLM Inference Workload
63 pages
Cs336 Spring2024 Assignment2 Systems
No ratings yet
Cs336 Spring2024 Assignment2 Systems
30 pages
Ms Word Parts and Functions
No ratings yet
Ms Word Parts and Functions
26 pages
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
36 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
No ratings yet
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
29 pages
Seedance 1.0: Exploring The Boundaries of Video Generation Models
No ratings yet
Seedance 1.0: Exploring The Boundaries of Video Generation Models
26 pages
Create AI Model Guide
No ratings yet
Create AI Model Guide
14 pages
Deep Learning Optimization
No ratings yet
Deep Learning Optimization
62 pages
AI Engineer Cheat Sheet Micro1
No ratings yet
AI Engineer Cheat Sheet Micro1
2 pages
Local LLM Inference and Fine-Tuning
100% (3)
Local LLM Inference and Fine-Tuning
26 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
TENSORRT
No ratings yet
TENSORRT
26 pages
Unlocking LLM Performance With Ebpf Optimizing Training and Inference Pipelines Chuan Hui Ebpfji Xi Llmxia Daep Xiao Zhen Relia Fa Qiu Yang Xiang Yunshan Networks Inc 1
No ratings yet
Unlocking LLM Performance With Ebpf Optimizing Training and Inference Pipelines Chuan Hui Ebpfji Xi Llmxia Daep Xiao Zhen Relia Fa Qiu Yang Xiang Yunshan Networks Inc 1
37 pages
Industrial Networking
No ratings yet
Industrial Networking
428 pages
DL Pipeline and Tutorial
No ratings yet
DL Pipeline and Tutorial
36 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
64 pages
Chat-Oi QWQ! VC Pode Analisar Essa Proposta de Projeto - 1
No ratings yet
Chat-Oi QWQ! VC Pode Analisar Essa Proposta de Projeto - 1
18 pages
Enterprise-Grade On-Premises LLM Inference Server
No ratings yet
Enterprise-Grade On-Premises LLM Inference Server
5 pages
Video Api Endpoint N
No ratings yet
Video Api Endpoint N
7 pages
Atc22 Slides Jia Xianyan
No ratings yet
Atc22 Slides Jia Xianyan
26 pages
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
No ratings yet
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
108 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
No ratings yet
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
28 pages
My GPT - 250604 - 193409
No ratings yet
My GPT - 250604 - 193409
9 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
No ratings yet
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
32 pages
PDF Copa Lesson Plan Sem1 in Doc File DL
No ratings yet
PDF Copa Lesson Plan Sem1 in Doc File DL
119 pages
Training AI Models On CPU. Revisiting CPU For ML
No ratings yet
Training AI Models On CPU. Revisiting CPU For ML
15 pages
Assignment 10
No ratings yet
Assignment 10
18 pages
Achieve Better Large Language Model Inference With Fewer GPUs
No ratings yet
Achieve Better Large Language Model Inference With Fewer GPUs
9 pages
PRIMA - CPP: Speeding Up 70B-Scale LLM Inference On Low-Resource Everyday Home Clusters
No ratings yet
PRIMA - CPP: Speeding Up 70B-Scale LLM Inference On Low-Resource Everyday Home Clusters
23 pages
Documentation
No ratings yet
Documentation
12 pages
Transformers Inference Optimization Toolset - AstraBlog
No ratings yet
Transformers Inference Optimization Toolset - AstraBlog
29 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
Deep Learning
No ratings yet
Deep Learning
46 pages
Fast Decode
No ratings yet
Fast Decode
15 pages
Week 2
No ratings yet
Week 2
4 pages
Video Api Endpoint
No ratings yet
Video Api Endpoint
2 pages
Deep Learning1
No ratings yet
Deep Learning1
23 pages
PDL Final Assignment-3 Aryan
No ratings yet
PDL Final Assignment-3 Aryan
8 pages
Deep Learning
No ratings yet
Deep Learning
8 pages
RA2211026010557 - SEAI Scenario 2
No ratings yet
RA2211026010557 - SEAI Scenario 2
3 pages
Part6 RT Professional Server Und Panel Client en
No ratings yet
Part6 RT Professional Server Und Panel Client en
23 pages
CMS3.0 User Manual
No ratings yet
CMS3.0 User Manual
29 pages
Cache-Augmented Generation (CAG) in LLMs - A Step-by-Step Tutorial - by Ronan Takizawa - Jan, 2025 - Medium
No ratings yet
Cache-Augmented Generation (CAG) in LLMs - A Step-by-Step Tutorial - by Ronan Takizawa - Jan, 2025 - Medium
15 pages
How To Download Books From Books Google With Google Book Download Stand Alone Program and Greasemonkey With Google Books Downloader Script
No ratings yet
How To Download Books From Books Google With Google Book Download Stand Alone Program and Greasemonkey With Google Books Downloader Script
5 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
07 Milestone Project 1 Food Vision
No ratings yet
07 Milestone Project 1 Food Vision
20 pages
Tensorflow Proposal
No ratings yet
Tensorflow Proposal
3 pages
ChatGPT LLM Website and AI Python Guide
No ratings yet
ChatGPT LLM Website and AI Python Guide
3 pages
Introduction To Python PDF
No ratings yet
Introduction To Python PDF
7 pages
Advanced Tech Stack For AI
No ratings yet
Advanced Tech Stack For AI
3 pages
IT Assistant: Job Description
No ratings yet
IT Assistant: Job Description
3 pages
GitanjaliJoshi QA 8years
No ratings yet
GitanjaliJoshi QA 8years
3 pages
AZ 104 Microsoft Azure Administrator
100% (8)
AZ 104 Microsoft Azure Administrator
431 pages
Resume Template
No ratings yet
Resume Template
2 pages
Benefits of Computer Technology
No ratings yet
Benefits of Computer Technology
5 pages
Guia de Importacion de Productos para El Software Unicenta
0% (1)
Guia de Importacion de Productos para El Software Unicenta
18 pages
Ancillary Services Additional Baggage Web Services Quick Card en 2017 08 17573313 en US
No ratings yet
Ancillary Services Additional Baggage Web Services Quick Card en 2017 08 17573313 en US
4 pages
Digital Voltmeter Using 8051 Microcontroller - Mini-Project - MyClassBook
No ratings yet
Digital Voltmeter Using 8051 Microcontroller - Mini-Project - MyClassBook
6 pages
Example: Model Train Controller.: Computers As Components 3e © 2012 Marilyn Wolf
No ratings yet
Example: Model Train Controller.: Computers As Components 3e © 2012 Marilyn Wolf
41 pages
Orcad电路设计与实践电子工业出版社华春梅 (等) 编著 12286272
No ratings yet
Orcad电路设计与实践电子工业出版社华春梅 (等) 编著 12286272
229 pages
Magtag Covid Tracking Project Iot Display: Created by Lady Ada
No ratings yet
Magtag Covid Tracking Project Iot Display: Created by Lady Ada
30 pages
Data Science Projects
No ratings yet
Data Science Projects
74 pages
808D For PC Simulator
No ratings yet
808D For PC Simulator
1 page
Threading
No ratings yet
Threading
36 pages
OpenShift 4 Technical Deep Dive
100% (5)
OpenShift 4 Technical Deep Dive
129 pages
AZ-303 Exam - Free Actual Q&as, Page 1 - ExamTopics
0% (1)
AZ-303 Exam - Free Actual Q&as, Page 1 - ExamTopics
5 pages
Subnetting Class C Addresses
No ratings yet
Subnetting Class C Addresses
9 pages
AZ 305 Designing Microsoft Azure Infrastructure Solutions
100% (10)
AZ 305 Designing Microsoft Azure Infrastructure Solutions
278 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
4 - Cisco Viptela
No ratings yet
4 - Cisco Viptela
1 page
DS Neptune NPT-1100
No ratings yet
DS Neptune NPT-1100
8 pages
Terraform Associate
100% (10)
Terraform Associate
465 pages
A Complete Guide To DevOps With AWS
100% (2)
A Complete Guide To DevOps With AWS
579 pages
ICDL Professional Modules - Computational - Using Databases
No ratings yet
ICDL Professional Modules - Computational - Using Databases
10 pages
AWS Course - All Slides
80% (10)
AWS Course - All Slides
879 pages
Azure Implementation Guide
100% (4)
Azure Implementation Guide
237 pages
Design Patterns For Blockchain-Based Self-Sovereign Identity - European Conference On Pattern Languages of Programs
No ratings yet
Design Patterns For Blockchain-Based Self-Sovereign Identity - European Conference On Pattern Languages of Programs
15 pages
PDF Made Easy - MyPerfectPDF
No ratings yet
PDF Made Easy - MyPerfectPDF
1 page
Kubernetes Practicals Ebook
75% (4)
Kubernetes Practicals Ebook
187 pages
Terraform Certified
100% (3)
Terraform Certified
121 pages
100 Days of Kubernetes
100% (4)
100 Days of Kubernetes
121 pages
6.1) Kubernetes Detailed Notes
100% (3)
6.1) Kubernetes Detailed Notes
75 pages
AZ 400T00A ENU TrainerHandbook
100% (5)
AZ 400T00A ENU TrainerHandbook
715 pages
Ansible For Kubernetes PDF
100% (6)
Ansible For Kubernetes PDF
172 pages
Docker Docker Tutorial For Beginners Build Ship and Run - Dennis Hutten
100% (11)
Docker Docker Tutorial For Beginners Build Ship and Run - Dennis Hutten
187 pages
Azure Networking Cookbook
100% (1)
Azure Networking Cookbook
288 pages
3 An Industrial Firm Supplies 10 Manufacturing PL Chegg Com
No ratings yet
3 An Industrial Firm Supplies 10 Manufacturing PL Chegg Com
2 pages
Roubleshooting Percona XtraDB Cluster
No ratings yet
Roubleshooting Percona XtraDB Cluster
4 pages
Answer Sheet For Class XII PREBOARD-I
No ratings yet
Answer Sheet For Class XII PREBOARD-I
3 pages
Terraform Practice Guide
100% (14)
Terraform Practice Guide
109 pages
Terraform Full Course
80% (5)
Terraform Full Course
34 pages
Learn Kubernetes 5 Minutes at A Time
No ratings yet
Learn Kubernetes 5 Minutes at A Time
187 pages
Kubernetes
100% (6)
Kubernetes
138 pages
Kubernetes Made Easy
100% (4)
Kubernetes Made Easy
136 pages
Kubernetes 110719 - Training
100% (5)
Kubernetes 110719 - Training
48 pages
Cka PDF
80% (5)
Cka PDF
58 pages
Azure Architecture Guide
100% (3)
Azure Architecture Guide
84 pages
Hands-On Kubernetes On Azure
100% (4)
Hands-On Kubernetes On Azure
330 pages
Container Networking Docker Kubernetes
100% (8)
Container Networking Docker Kubernetes
72 pages
Diving Deep Into Kubernetes Networking
100% (3)
Diving Deep Into Kubernetes Networking
42 pages
Kubernetes Concepts
No ratings yet
Kubernetes Concepts
623 pages
Terraform From Bigginer To Master
100% (4)
Terraform From Bigginer To Master
90 pages
Azure Solution Architect Map
100% (1)
Azure Solution Architect Map
1 page
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mamood Alassouli
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Linux DevOps Tools Engineer (701) Practice Tests: 400 Questions to Ace Your Certification
From Everand
Linux DevOps Tools Engineer (701) Practice Tests: 400 Questions to Ace Your Certification
Steve Brown
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Network Security All-in-one: ASA Firepower WSA Umbrella VPN ISE Layer 2 Security
From Everand
Network Security All-in-one: ASA Firepower WSA Umbrella VPN ISE Layer 2 Security
Redouane MEDDANE
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mahmood Alassouli
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Optimizing Inference Server For Maximum Tokens - Sec

Uploaded by

Optimizing Inference Server For Maximum Tokens - Sec

Uploaded by

Optimizing Inference Server for Maximum

Software Optimization Techniques

# Example of loading with 4-bit quantization

2. Inference Framework Selection

4. Model Serving Configuration

Configure operator fusion to improve latency [5]

sudo apt install -y nvidia-driver-535

3. Install Docker and NVIDIA Container Toolkit:

curl -fsSL https://get.docker.com -o get-docker.sh

4. Deploy vLLM for maximum performance:

# Pull and run vLLM container

# Start vLLM server with optimized settings

5. Alternative: Use TensorRT-LLM for potentially higher performance:

# Build TensorRT engine

# Run inference server

Performance Monitoring and Tuning

2. Measure tokens per second:

3. Fine-tune parameters based on your specific model and hardware:

You might also like