Open In App

Deploying PyTorch Models with TorchServe

Last Updated : 24 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

TorchServe is an open-source model serving framework specifically designed for PyTorch models. Developed through collaboration between Facebook AI Research and AWS, it enables efficient, scalable and production ready deployment of machine learning models by bridging the gap between model development and real-world applications. TorchServe’s architecture is composed of three main components: the Frontend, the Process Orchestrator and the Backend, all working together to serve machine learning models efficiently.

torchserve
TorchServe Architecture

The Frontend manages incoming API requests (inference and management), optionally preprocesses them via request branding and routes them through model-specific threads and endpoints. The Process Orchestrator handles communication and coordination between the frontend and backend, dynamically managing model workers based on load. The Backend runs actual model inferences using isolated worker processes, each linked to a specific model and handler script. Models are loaded from a centralized Model Store, ensuring scalability, performance isolation and flexibility in deployment.

Why Use TorchServe?

Deploying advanced machine learning models often involves complexities such as infrastructure setup, REST API management, scaling and monitoring. TorchServe addresses these pain points by offering:

  • Quick deployment without extensive engineering overhead.
  • RESTful endpoints for model inference.
  • Model versioning and dynamic management.
  • Built-in monitoring and logging for production reliability.
  • Support for scalable deployment, including in cloud or containerized environments.

Key Features of TorchServe

torchserve
Key Features of TorchServe
  • Dynamic Model Management: Load, unload and update models without restarting the server.
  • Multi-Model Serving: Serve multiple models and versions concurrently on a single instance.
  • Custom Handlers: Integrate custom Python scripts for pre-processing and post-processing.
  • Monitoring and Metrics: Expose Prometheus metrics, health checks and detailed logs.
  • Batch Inference and Resource Control: Automatically batches requests and allows fine-grained resource allocation.
  • Cloud and Container Support: Native compatibility with major cloud platforms and Docker/Kubernetes for horizontal scaling.

TorchServe Architecture

TorchServe exposes three main APIs, each serving a distinct function:

API Type

Port

Function

Management

8080

Register, unregister, list models, configure resources

Inference

8081

Send prediction/inference requests, receive model outputs

Metrics

8082

Real-time model metrics: request/response stats, error tracking, health

Models are generally packaged as .mar (Model ARchive Repository) files which encapsulate the model weights, configuration and handlers for deployment.

Step-by-Step Guide to Deploy PyTorch Models with TorchServe

1. Install TorchServe and Dependencies

Python
pip install torchserve torch-model-archiver


Alternatively, you can use Docker images (pytorch/torchserve:latest-gpu or latest-cpu).

2. Prepare Your Model

Export your trained PyTorch model as a serialized file (model.pt or similar). Write a handler script (handler.py) if custom pre/post-processing logic is necessary.

3. Archive Your Model

Use the model archiver to bundle your model, handler and any extra files into a .mar archive.

Python
torch-model-archiver \
  --model-name my_model \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model.pt \
  --handler my_handler.py \
  --export-path model_store

The exported .mar file will reside in the model_store directory.

4. Launch TorchServe

Start TorchServe locally or containerized:

Local Example:

Python
torchserve --start --ncs --model-store model_store --models my_model.mar

Docker Example:

Python
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v $(pwd)/model_store:/model-store pytorch/torchserve:latest-gpu torchserve --model-store /model-store --models my_model.mar

This exposes the management and inference ports for serving HTTP requests.

5. Register and Manage Models

TorchServe offers CLI and RESTful API options to register new models, update versions and remove outdated models on the fly without restarting the service. This ensures high uptime and simple model lifecycle management.

6. Sending Inference Requests

With the model deployed, you can hit the inference API:

Python
curl -X POST http://localhost:8081/predictions/my_model -T input_data.json

The API responds with inference results based on your model's output logic.

7. Monitoring and Logging

Metrics are exposed on port 8082, compatible with Prometheus and other monitoring tools. TorchServe supports detailed logging, health checks and performance analysis via built-in endpoints and log files.

Advanced TorchServe Features

  • Multi-Version Model Serving: Serve several versions for A/B testing or rollback.
  • Dynamic Batching: Improve throughput by batching inference requests.
  • Custom Workflows and Handlers: Integrate business logic or non-standard input formats.
  • Scalability: Integrate with orchestration frameworks like Kubernetes for autoscaling.
  • TorchScript and Eager Mode Support: Compatible with both TorchScripted and eager PyTorch models.

Deployment in Cloud and Edge Environments

TorchServe works seamlessly on:

  • AWS SageMaker, Azure Machine Learning, Google Cloud Vertex AI using custom containers.
  • Any self-managed infrastructure or Kubernetes cluster, ensuring portability and scalability.

Similar Reads