Create an Elasticsearch inference endpoint | Elasticsearch API documentation (v8)

Create an Elasticsearch inference endpoint Generally available; Added in 8.13.0

PUT /_inference/{task_type}/{elasticsearch_inference_id}

Create an inference endpoint to perform an inference task with the elasticsearch service.

Your Elasticsearch deployment contains preconfigured ELSER and E5 inference endpoints, you only need to create the enpoints using the API if you want to customize the settings.

If you use the ELSER or the E5 model through the elasticsearch service, the API request will automatically download and deploy the model if it isn't downloaded yet.

You might see a 502 bad gateway error in the response when using the Kibana Console. This error usually just reflects a timeout, while the model downloads in the background. You can check the download progress in the Machine Learning UI. If using the Python client, you can set the timeout parameter to a higher value.

After creating the endpoint, wait for the model deployment to complete before using it. To verify the deployment status, use the get trained model statistics API. Look for "state": "fully_allocated" in the response and ensure that the "allocation_count" matches the "target_allocation_count". Avoid creating multiple endpoints for the same model unless required, as each endpoint consumes significant resources.

Required authorization

Cluster privileges: manage_inference

Path parameters

task_type string

The type of the inference task that the model will perform.

Values are rerank, sparse_embedding, or text_embedding.
elasticsearch_inference_id string Required

The unique identifier of the inference endpoint. The must not match the model_id.

Query parameters

timeout string

Specifies the amount of time to wait for the inference endpoint to be created.

External documentation

application/json

Body Required

chunking_settings object

The chunking configuration object.

External documentation
Hide chunking_settings attributes Show chunking_settings attributes object
- max_chunk_size number
  
  The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
  
  Default value is 250.
- overlap number
  
  The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.
  
  Default value is 100.
- sentence_overlap number
  
  The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.
  
  Default value is 1.
- strategy string
  
  The chunking strategy: sentence or word.
  
  Default value is sentence.
service string Required

The type of service supported for the specified task type. In this case, elasticsearch.

Value is elasticsearch.
service_settings object Required

Settings used to install the inference model. These settings are specific to the elasticsearch service.
Hide service_settings attributes Show service_settings attributes object
- adaptive_allocations object
  
  Adaptive allocations configuration details. If enabled is true, the number of allocations of the model is set based on the current load the process gets. When the load is high, a new model allocation is automatically created, respecting the value of max_number_of_allocations if it's set. When the load is low, a model allocation is automatically removed, respecting the value of min_number_of_allocations if it's set. If enabled is true, do not set the number of allocations manually.
  Hide adaptive_allocations attributes Show adaptive_allocations attributes object
  
  enabled boolean
  
  Turn on adaptive_allocations.
  
  Default value is false.
  
  max_number_of_allocations number
  
  The maximum number of allocations to scale to. If set, it must be greater than or equal to min_number_of_allocations.
  
  min_number_of_allocations number
  
  The minimum number of allocations to scale to. If set, it must be greater than or equal to 0. If not defined, the deployment scales to 0.
- deployment_id string
  
  The deployment identifier for a trained model deployment. When deployment_id is used the model_id is optional.
- model_id string Required
  
  The name of the model to use for the inference task. It can be the ID of a built-in model (for example, .multilingual-e5-small for E5) or a text embedding model that was uploaded by using the Eland client.
  
  External documentation
- num_allocations number
  
  The total number of allocations that are assigned to the model across machine learning nodes. Increasing this value generally increases the throughput. If adaptive allocations are enabled, do not set this value because it's automatically set.
- num_threads number Required
  
  The number of threads used by each model allocation during inference. This setting generally increases the speed per inference request. The inference process is a compute-bound process; threads_per_allocations must not exceed the number of available allocated processors per node. The value must be a power of 2. The maximum value is 32.
task_settings object

Settings to configure the inference task. These settings are specific to the task type you specified.
Hide task_settings attribute Show task_settings attribute object
- return_documents boolean
  
  For a rerank task, return the document instead of only the index.
  
  Default value is true.

Responses

200 application/json
Hide response attributes Show response attributes object
- chunking_settings object
  
  Chunking configuration object
  
  Hide chunking_settings attributes Show chunking_settings attributes object
  
  max_chunk_size number
  
  The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
  
  Default value is 250.
  
  overlap number
  
  The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.
  
  Default value is 100.
  
  sentence_overlap number
  
  The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.
  
  Default value is 1.
  
  strategy string
  
  The chunking strategy: sentence or word.
  
  Default value is sentence.
- service string Required
  
  The service type
- service_settings object Required
  
  Settings specific to the service
- task_settings object
  
  Task settings specific to the service and task type
- inference_id string Required
  
  The inference Id
- task_type string Required
  
  The task type
  
  Values are sparse_embedding, text_embedding, or rerank.

PUT /_inference/{task_type}/{elasticsearch_inference_id}

PUT _inference/sparse_embedding/my-elser-model
{
    "service": "elasticsearch",
    "service_settings": {
        "adaptive_allocations": { 
        "enabled": true,
        "min_number_of_allocations": 1,
        "max_number_of_allocations": 4
        },
        "num_threads": 1,
        "model_id": ".elser_model_2" 
    }
}

resp = client.inference.put(
    task_type="sparse_embedding",
    inference_id="my-elser-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "adaptive_allocations": {
                "enabled": True,
                "min_number_of_allocations": 1,
                "max_number_of_allocations": 4
            },
            "num_threads": 1,
            "model_id": ".elser_model_2"
        }
    },
)

const response = await client.inference.put({
  task_type: "sparse_embedding",
  inference_id: "my-elser-model",
  inference_config: {
    service: "elasticsearch",
    service_settings: {
      adaptive_allocations: {
        enabled: true,
        min_number_of_allocations: 1,
        max_number_of_allocations: 4,
      },
      num_threads: 1,
      model_id: ".elser_model_2",
    },
  },
});

response = client.inference.put(
  task_type: "sparse_embedding",
  inference_id: "my-elser-model",
  body: {
    "service": "elasticsearch",
    "service_settings": {
      "adaptive_allocations": {
        "enabled": true,
        "min_number_of_allocations": 1,
        "max_number_of_allocations": 4
      },
      "num_threads": 1,
      "model_id": ".elser_model_2"
    }
  }
)

$resp = $client->inference()->put([
    "task_type" => "sparse_embedding",
    "inference_id" => "my-elser-model",
    "body" => [
        "service" => "elasticsearch",
        "service_settings" => [
            "adaptive_allocations" => [
                "enabled" => true,
                "min_number_of_allocations" => 1,
                "max_number_of_allocations" => 4,
            ],
            "num_threads" => 1,
            "model_id" => ".elser_model_2",
        ],
    ],
]);

curl -X PUT -H "Authorization: ApiKey $ELASTIC_API_KEY" -H "Content-Type: application/json" -d '{"service":"elasticsearch","service_settings":{"adaptive_allocations":{"enabled":true,"min_number_of_allocations":1,"max_number_of_allocations":4},"num_threads":1,"model_id":".elser_model_2"}}' "$ELASTICSEARCH_URL/_inference/sparse_embedding/my-elser-model"

client.inference().put(p -> p
    .inferenceId("my-elser-model")
    .taskType(TaskType.SparseEmbedding)
    .inferenceConfig(i -> i
        .service("elasticsearch")
        .serviceSettings(JsonData.fromJson("{\"adaptive_allocations\":{\"enabled\":true,\"min_number_of_allocations\":1,\"max_number_of_allocations\":4},\"num_threads\":1,\"model_id\":\".elser_model_2\"}"))
    )
);

Request examples

Run `PUT _inference/sparse_embedding/my-elser-model` to create an inference endpoint that performs a `sparse_embedding` task. The `model_id` must be the ID of one of the built-in ELSER models. The API will automatically download the ELSER model if it isn't already downloaded and then deploy the model.

{
    "service": "elasticsearch",
    "service_settings": {
        "adaptive_allocations": { 
        "enabled": true,
        "min_number_of_allocations": 1,
        "max_number_of_allocations": 4
        },
        "num_threads": 1,
        "model_id": ".elser_model_2" 
    }
}

Run `PUT _inference/rerank/my-elastic-rerank` to create an inference endpoint that performs a rerank task using the built-in Elastic Rerank cross-encoder model. The `model_id` must be `.rerank-v1`, which is the ID of the built-in Elastic Rerank model. The API will automatically download the Elastic Rerank model if it isn't already downloaded and then deploy the model. Once deployed, the model can be used for semantic re-ranking with a `text_similarity_reranker` retriever.

{
    "service": "elasticsearch",
    "service_settings": {
        "model_id": ".rerank-v1", 
        "num_threads": 1,
        "adaptive_allocations": { 
        "enabled": true,
        "min_number_of_allocations": 1,
        "max_number_of_allocations": 4
        }
    }
}

Run `PUT _inference/text_embedding/my-e5-model` to create an inference endpoint that performs a `text_embedding` task. The `model_id` must be the ID of one of the built-in E5 models. The API will automatically download the E5 model if it isn't already downloaded and then deploy the model.

{
    "service": "elasticsearch",
    "service_settings": {
        "num_allocations": 1,
        "num_threads": 1,
        "model_id": ".multilingual-e5-small" 
    }
}

Run `PUT _inference/text_embedding/my-msmarco-minilm-model` to create an inference endpoint that performs a `text_embedding` task with a model that was uploaded by Eland.

{
    "service": "elasticsearch",
    "service_settings": {
        "num_allocations": 1,
        "num_threads": 1,
        "model_id": "msmarco-MiniLM-L12-cos-v5" 
    }
}

Run `PUT _inference/text_embedding/my-e5-model` to create an inference endpoint that performs a `text_embedding` task and to configure adaptive allocations. The API request will automatically download the E5 model if it isn't already downloaded and then deploy the model.

{
    "service": "elasticsearch",
    "service_settings": {
        "adaptive_allocations": {
        "enabled": true,
        "min_number_of_allocations": 3,
        "max_number_of_allocations": 10
        },
        "num_threads": 1,
        "model_id": ".multilingual-e5-small"
    }
}

Run `PUT _inference/sparse_embedding/use_existing_deployment` to use an already existing model deployment when creating an inference endpoint.

{
    "service": "elasticsearch",
    "service_settings": {
        "deployment_id": ".elser_model_2"
    }
}

Response examples (200)

A successful response from `PUT _inference/sparse_embedding/use_existing_deployment`. It contains the model ID and the threads and allocations settings from the model deployment.

{
  "inference_id": "use_existing_deployment",
  "task_type": "sparse_embedding",
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 2,
    "num_threads": 1,
    "model_id": ".elser_model_2",
    "deployment_id": ".elser_model_2"
  },
  "chunking_settings": {
    "strategy": "sentence",
    "max_chunk_size": 250,
    "sentence_overlap": 1
  }
}