Manual model conversion on GPU

This article introduces the manual workflow for converting LLM models using a local Nvidia GPU. It describes the required environment setup, execution steps, and how to run inference on a Windows Copilot+ PC with a Qualcomm NPU.

Conversion of LLM models requires a Nvidia GPU. If you want model lab to manage your local GPU, follow the steps in Convert Model. Otherwise, follow the steps in this article.

Manual run model conversion on GPU

This workflow is configured using the qnn_config.json file and requires two separate Python environments.

The first environment is used for model conversion with GPU acceleration and includes packages like onnxruntime-gpu and AutoGPTQ.
The second environment is used for QNN optimization and includes packages like onnxruntime-qnn with specific dependencies.

First environment setup

In a Python 3.10 x64 Python environment with Olive installed, install the required packages:

# Install common dependencies
pip install -r requirements.txt

# Install ONNX Runtime GPU packages
pip install "onnxruntime-gpu>=1.21.0" "onnxruntime-genai-cuda>=0.6.0"

# AutoGPTQ: Install from source (stable package may be slow for weight packing)
# Disable CUDA extension build (not required)
# Linux
export BUILD_CUDA_EXT=0
# Windows
# set BUILD_CUDA_EXT=0

# Install AutoGPTQ from source
pip install --no-build-isolation git+https://github.com/PanQiWei/AutoGPTQ.git

# Please update CUDA version if needed
pip install torch --index-url https://download.pytorch.org/whl/cu121

⚠️ Only set up the environment and install the packages. Do not run the olive run command at this point.

Second environment setup

In a Python 3.10 x64 Python environment with Olive installed, install the required packages:

# Install ONNX Runtime QNN
pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
pip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps

Replace /path/to/qnn/env/bin in qnn_config.json with the path to the directory containing the second environment's Python executable.

Run the config

Activate the first environment and run the workflow:

olive run --config qnn_config.json

After completing this command, the optimized model is saved in: ./model/model_name.

⚠️ If optimization fails due to out of memory, please remove calibration_providers in config file.

⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.

Manual run inference samples

The optimized model can be used for inference using ONNX Runtime QNN Execution Provider and ONNX Runtime GenAI. Inference must be run on a Windows Copilot+ PC with a Qualcomm NPU.

Install required packages on arm64 Python environment

Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment with Olive installed, install the required packages:

pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
pip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
pip install "onnxruntime-genai>=0.7.0rc2"

Run interface sample

Execute the provided inference_sample.ipynb notebook. Select ipykernel to this arm64 Python environment.

⚠️ If you get a 6033 error, replace genai_config.json in the ./model/model_name folder.

05/13/2025