You can evaluate the performance of foundation models and your tuned generative AI models on Vertex AI. The models are evaluated using a set of metrics against an evaluation dataset that you provide. This page explains how computation-based model evaluation through the evaluation pipeline service works, how to create and format the evaluation dataset, and how to perform the evaluation using the Google Cloud console, Vertex AI API, or the Vertex AI SDK for Python.
How computation-based model evaluation works
To evaluate the performance of a model, you first create an evaluation dataset that contains prompt and ground truth pairs. For each pair, the prompt is the input that you want to evaluate, and the ground truth is the ideal response for that prompt. During evaluation, the prompt in each pair of the evaluation dataset is passed to the model to produce an output. The output generated by the model and the ground truth from the evaluation dataset are used to compute the evaluation metrics.
The type of metrics used for evaluation depends on the task that you are evaluating. The following table shows the supported tasks and the metrics used to evaluate each task:
Task | Metric |
---|---|
Classification | Micro-F1, Macro-F1, Per class F1 |
Summarization | ROUGE-L |
Question answering | Exact Match |
Text generation | BLEU, ROUGE-L |
Supported models
Model evaluation is supported for the following models:
text-bison
: Base and tuned versions.Gemini: All tasks except classification.
Prepare evaluation dataset
The evaluation dataset that's used for model evaluation includes prompt and ground truth pairs that align with the task that you want to evaluate. Your dataset must include a minimum of 1 prompt and ground truth pair and at least 10 pairs for meaningful metrics. The more examples you give, the more meaningful the results.
Dataset format
Your evaluation dataset must be in JSON Lines (JSONL)
format where each line contains a single prompt and ground truth pair specified
in the input_text
and output_text
fields, respectively. The input_text
field contains the prompt that you want to evaluate, and the output_text
field
contains the ideal response for the prompt.
The maximum token length for input_text
is 8,192, and the maximum token length
for output_text
is 1,024.
Upload evaluation dataset to Cloud Storage
You can either create a new Cloud Storage bucket or use an existing one to store your dataset file. The bucket must be in the same region as the model.
After your bucket is ready, upload your dataset file to the bucket.
Perform model evaluation
You can evaluate models by using the REST API or the Google Cloud console.
REST
To create a model evaluation job, send a POST
request by using the
pipelineJobs method.
Before using any of the request data, make the following replacements:
- PROJECT_ID: The Google Cloud project that runs the pipeline components.
- PIPELINEJOB_DISPLAYNAME: A display name for the pipelineJob.
- LOCATION: The region to run the pipeline components.
Currently, only
us-central1
is supported. - DATASET_URI: The Cloud Storage URI of your reference dataset. You can specify one or multiple URIs. This parameter supports wildcards. To learn more about this parameter, see InputConfig.
- OUTPUT_DIR: The Cloud Storage URI to store evaluation output.
- MODEL_NAME: Specify a publisher model or a tuned
model resource as follows:
- Publisher model:
publishers/google/models/MODEL@MODEL_VERSION
Example:
publishers/google/models/text-bison@002
- Tuned model:
projects/PROJECT_NUMBER/locations/LOCATION/models/ENDPOINT_ID
Example:
projects/123456789012/locations/us-central1/models/1234567890123456789
The evaluation job doesn't impact any existing deployments of the model or their resources.
- Publisher model:
- EVALUATION_TASK: The task that you want to
evaluate the model on. The evaluation job computes a set of metrics relevant to that specific
task. Acceptable values include the following:
summarization
question-answering
text-generation
classification
- INSTANCES_FORMAT: The format of your dataset.
Currently, only
jsonl
is supported. To learn more about this parameter, see InputConfig. - PREDICTIONS_FORMAT: The format of the
evaluation output. Currently, only
jsonl
is supported. To learn more about this parameter, see InputConfig. - MACHINE_TYPE: (Optional) The machine type for
running the evaluation job. The default value is
e2-highmem-16
. For a list of supported machine types, see Machine types. - SERVICE_ACCOUNT: (Optional) The service account to use for running the evaluation job. To learn how to create a custom service account, see Configure a service account with granular permissions. If unspecified, the Vertex AI Custom Code Service Agent is used.
- NETWORK: (Optional) The fully qualified name of the
Compute Engine network to peer the evaluatiuon job to. The format of the network name is
projects/PROJECT_NUMBER/global/networks/NETWORK_NAME
. If you specify this field, you need to have a VPC Network Peering for Vertex AI. If left unspecified, the evaluation job is not peered with any network. - KEY_NAME: (Optional) The name of the customer-managed
encryption key (CMEK). If configured, resources created by the evaluation job is encrypted using
the provided encryption key. The format of the key name is
projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING/cryptoKeys/KEY
. The key needs to be in the same region as the evaluation job.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs
Request JSON body:
{ "displayName": "PIPELINEJOB_DISPLAYNAME", "runtimeConfig": { "gcsOutputDirectory": "gs://OUTPUT_DIR", "parameterValues": { "project": "PROJECT_ID", "location": "LOCATION", "batch_predict_gcs_source_uris": ["gs://DATASET_URI"], "batch_predict_gcs_destination_output_uri": "gs://OUTPUT_DIR", "model_name": "MODEL_NAME", "evaluation_task": "EVALUATION_TASK", "batch_predict_instances_format": "INSTANCES_FORMAT", "batch_predict_predictions_format: "PREDICTIONS_FORMAT", "machine_type": "MACHINE_TYPE", "service_account": "SERVICE_ACCOUNT", "network": "NETWORK", "encryption_spec_key_name": "KEY_NAME" } }, "templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1" }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs" | Select-Object -Expand Content
You should receive a JSON response similar to the following. Note that pipelineSpec
has been truncated to save space.
Example curl command
PROJECT_ID=myproject
REGION=us-central1
MODEL_NAME=publishers/google/models/text-bison@002
TEST_DATASET_URI=gs://my-gcs-bucket-uri/dataset.jsonl
OUTPUT_DIR=gs://my-gcs-bucket-uri/output
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
"https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/pipelineJobs" -d \
$'{
"displayName": "evaluation-llm-text-generation-pipeline",
"runtimeConfig": {
"gcsOutputDirectory": "'${OUTPUT_DIR}'",
"parameterValues": {
"project": "'${PROJECT_ID}'",
"location": "'${REGION}'",
"batch_predict_gcs_source_uris": ["'${TEST_DATASET_URI}'"],
"batch_predict_gcs_destination_output_uri": "'${OUTPUT_DIR}'",
"model_name": "'${MODEL_NAME}'",
}
},
"templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}'
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Console
To create a model evaluation job by using the Google Cloud console, perform the following steps:
- In the Google Cloud console, go to the Vertex AI Model Registry page.
- Click the name of the model that you want to evaluate.
- In the Evaluate tab, click Create evaluation and configure as follows:
- Objective: Select the task that you want to evaluate.
- Target column or field: (Classification only) Enter the target
column for prediction. Example:
ground_truth
. - Source path: Enter or select the URI of your evaluation dataset.
- Output format: Enter the format of the evaluation output.
Currently, only
jsonl
is supported. - Cloud Storage path: Enter or select the URI to store evaluation output.
- Class names: (Classification only) Enter the list of possible class names.
- Number of compute nodes: Enter the number of compute nodes to run the evaluation job.
- Machine type: Select a machine type to use for running the evaluation job.
- Click Start evaluation
View evaluation results
You can find the evaluation results in the Cloud Storage output directory
that you specified when creating the evaluation job. The file is named
evaluation_metrics.json
.
For tuned models, you can also view evaluation results in the Google Cloud console:
In the Vertex AI section of the Google Cloud console, go to the Vertex AI Model Registry page.
Click the name of the model to view its evaluation metrics.
In the Evaluate tab, click the name of the evaluation run that you want to view.
What's next
- Learn about generative AI evaluation.
- Learn about online evaluation with Gen AI Evaluation Service.
- Learn how to tune a foundation model.