Create a Llama inference endpoint
Generally available; Added in 9.2.0
Path parameters
-
The type of the inference task that the model will perform.
Values are
text_embedding
,completion
, orchat_completion
. -
The unique identifier of the inference endpoint.
Query parameters
-
Specifies the amount of time to wait for the inference endpoint to be created.
Values are
-1
or0
.
PUT
/_inference/{task_type}/{llama_inference_id}
Console
PUT _inference/text_embedding/llama-text-embedding
{
"service": "llama",
"service_settings": {
"url": "http://localhost:8321/v1/inference/embeddings"
"dimensions": 384,
"model_id": "all-MiniLM-L6-v2"
}
}
curl \
--request PUT 'http://api.example.com/_inference/{task_type}/{llama_inference_id}' \
--header "Authorization: $API_KEY" \
--header "Content-Type: application/json" \
--data '"{\n \"service\": \"llama\",\n \"service_settings\": {\n \"url\": \"http://localhost:8321/v1/inference/embeddings\"\n \"dimensions\": 384,\n \"model_id\": \"all-MiniLM-L6-v2\" \n }\n}"'
Request examples
Put llama request example1
Run `PUT _inference/text_embedding/llama-text-embedding` to create a Llama inference endpoint that performs a `text_embedding` task.
{
"service": "llama",
"service_settings": {
"url": "http://localhost:8321/v1/inference/embeddings"
"dimensions": 384,
"model_id": "all-MiniLM-L6-v2"
}
}
Run `PUT _inference/completion/llama-completion` to create a Llama inference endpoint that performs a `completion` task.
{
"service": "llama",
"service_settings": {
"url": "http://localhost:8321/v1/openai/v1/chat/completions"
"model_id": "llama3.2:3b"
}
}
Run `PUT _inference/chat-completion/llama-chat-completion` to create a Llama inference endpoint that performs a `chat_completion` task.
{
"service": "llama",
"service_settings": {
"url": "http://localhost:8321/v1/openai/v1/chat/completions"
"model_id": "llama3.2:3b"
}
}