Fucntion Calling MLX Server

A lightweight HTTP server for running Qwen and GPT_OSS language models with MLX (Metal for Mac) acceleration. This server provides an OpenAI-compatible API interface for serving Qwen and OSS models locally.

Features

OpenAI-compatible API: Supports /v1/chat/completions and /v1/completions endpoints
MLX acceleration: Leverages Metal for Mac (MLX) for fast inference on Apple Silicon
Speculative decoding: Supports draft models for faster generation
Prompt caching: Efficiently reuses common prompt prefixes
Tool calling support: Native support for function calling with custom formats
Streaming responses: Real-time token streaming support
Model adapters: Support for fine-tuned model adapters

Installation

# Install the package
uv sync

Usage

Start the server with a Qwen model:

# Basic usage
uv run main.py --type qwen --model <path-to-qwen-model>

# With custom host and port
uv run main.py --type qwen --host 0.0.0.0 --port 8080 --model <path-to-qwen-model>

# With draft model for speculative decoding
uv run main.py --type qwen --model <path-to-qwen-model> --draft-model <path-to-draft-model>

API Endpoints

Chat Completions

POST /v1/chat/completions

Example request:

{
  "model": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false
}

Text Completions

POST /v1/completions

Example request:

{
  "model": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
  "prompt": "Hello, my name is",
  "stream": false
}

Health Check

GET /health

Model List

GET /v1/models

Configuration Options

--model: Path to the MLX model weights, tokenizer, and config
--adapter-path: Optional path for trained adapter weights and config
--host: Host for the HTTP server (default: 127.0.0.1)
--port: Port for the HTTP server (default: 8080)
--draft-model: Model to be used for speculative decoding
--num-draft-tokens: Number of tokens to draft when using speculative decoding
--trust-remote-code: Enable trusting remote code for tokenizer
--log-level: Set the logging level (default: INFO)
--chat-template: Specify a chat template for the tokenizer
--use-default-chat-template: Use the default chat template
--temp: Default sampling temperature (default: 0.0)
--top-p: Default nucleus sampling top-p (default: 1.0)
--top-k: Default top-k sampling (default: 0, disables top-k)
--min-p: Default min-p sampling (default: 0.0, disables min-p)
--max-tokens: Default maximum number of tokens to generate (default: 512)

Example Usage

Using curl to test the server:

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "stream": false
  }'

Streaming response:

# Streaming chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a short story about a robot learning to paint."}
    ],
    "stream": true
  }'

Using with sst/opencode

Example configuration

{
  "$schema": "https://opencode.ai/config.json",
  "share": "disabled",
  "provider": {
    "mlx-lm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "mlx-lm (local)",
      "options": {
        "baseURL": "http://127.0.0.1:28100/v1"
      },
      "models": {
        "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ": {
          "name": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ",
          "options": {
            "max_tokens": 128000,
          },
          "tools": true
        }
      }
    }
  }
}

Development

Running Tests

# Run the server in development mode
uv run main.py --model <path-to-model>

Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project is built on top of:

MLX - Metal for Mac
mlx-lm - MLX Language Model Inference
Hugging Face Transformers

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
oss		oss
qwen		qwen
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fucntion Calling MLX Server

Features

Installation

Usage

API Endpoints

Chat Completions

Text Completions

Health Check

Model List

Configuration Options

Example Usage

Using curl to test the server:

Streaming response:

Using with sst/opencode

Development

Running Tests

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

donus3/function-call-mlx-server

Folders and files

Latest commit

History

Repository files navigation

Fucntion Calling MLX Server

Features

Installation

Usage

API Endpoints

Chat Completions

Text Completions

Health Check

Model List

Configuration Options

Example Usage

Using curl to test the server:

Streaming response:

Using with sst/opencode

Development

Running Tests

Contributing

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages