A lightweight HTTP server for running Qwen and GPT_OSS language models with MLX (Metal for Mac) acceleration. This server provides an OpenAI-compatible API interface for serving Qwen and OSS models locally.
- OpenAI-compatible API: Supports
/v1/chat/completionsand/v1/completionsendpoints - MLX acceleration: Leverages Metal for Mac (MLX) for fast inference on Apple Silicon
- Speculative decoding: Supports draft models for faster generation
- Prompt caching: Efficiently reuses common prompt prefixes
- Tool calling support: Native support for function calling with custom formats
- Streaming responses: Real-time token streaming support
- Model adapters: Support for fine-tuned model adapters
# Install the package
uv syncStart the server with a Qwen model:
# Basic usage
uv run main.py --type qwen --model <path-to-qwen-model>
# With custom host and port
uv run main.py --type qwen --host 0.0.0.0 --port 8080 --model <path-to-qwen-model>
# With draft model for speculative decoding
uv run main.py --type qwen --model <path-to-qwen-model> --draft-model <path-to-draft-model>POST /v1/chat/completionsExample request:
{
"model": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": false
}POST /v1/completionsExample request:
{
"model": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
"prompt": "Hello, my name is",
"stream": false
}GET /healthGET /v1/models--model: Path to the MLX model weights, tokenizer, and config--adapter-path: Optional path for trained adapter weights and config--host: Host for the HTTP server (default: 127.0.0.1)--port: Port for the HTTP server (default: 8080)--draft-model: Model to be used for speculative decoding--num-draft-tokens: Number of tokens to draft when using speculative decoding--trust-remote-code: Enable trusting remote code for tokenizer--log-level: Set the logging level (default: INFO)--chat-template: Specify a chat template for the tokenizer--use-default-chat-template: Use the default chat template--temp: Default sampling temperature (default: 0.0)--top-p: Default nucleus sampling top-p (default: 1.0)--top-k: Default top-k sampling (default: 0, disables top-k)--min-p: Default min-p sampling (default: 0.0, disables min-p)--max-tokens: Default maximum number of tokens to generate (default: 512)
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"stream": false
}'# Streaming chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about a robot learning to paint."}
],
"stream": true
}'Example configuration
{
"$schema": "https://opencode.ai/config.json",
"share": "disabled",
"provider": {
"mlx-lm": {
"npm": "@ai-sdk/openai-compatible",
"name": "mlx-lm (local)",
"options": {
"baseURL": "http://127.0.0.1:28100/v1"
},
"models": {
"mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ": {
"name": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
"options": {
"max_tokens": 128000,
},
"tools": true
}
}
}
}
}# Run the server in development mode
uv run main.py --model <path-to-model>- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This project is built on top of:
- MLX - Metal for Mac
- mlx-lm - MLX Language Model Inference
- Hugging Face Transformers