This page provides a foundation for learning how to accelerate machine learning (ML) workloads using TPUs in Google Kubernetes Engine (GKE). TPUs are designed for matrix multiplication processing, such as large-scale deep learning model training. TPUs are optimized to handle the enormous datasets and complex models of ML and therefore are more cost-effective and energy efficient for ML workloads due to their superior performance. In this guide, you learn how to deploy ML workloads by using Cloud TPU accelerators, configure quotas for TPUs, configure upgrades for node pools that run TPUs, and monitor TPU workload metrics.
This tutorial is intended for Machine learning (ML) engineers and Platform admins and operators who are interested in using Kubernetes container orchestration to manage large-scale model training, tuning, and inference workloads using TPUs. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Before reading this page, ensure that you're familiar with the following:
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Plan your TPU configuration
Plan your TPU configuration based on your model and how much memory it requires. Before you use this guide to deploy your workloads on TPU, complete the planning steps in Plan your TPU configuration.
Ensure that you have TPU quota
The following sections help you ensure that you have enough quota when using TPUs in GKE.Quota for on-demand or Spot VMs
If you are creating a TPU slice node pool with on-demand or Spot VMs, you must have sufficient TPU quota available in the region that you want to use.
Creating a TPU slice node pool that consumes a TPU reservation does not require any TPU quota.1 You may safely skip this section for reserved TPUs.
Creating an on-demand or Spot TPU slice node pool in GKE requires Compute Engine API quota. Compute Engine API quota (compute.googleapis.com) is not the same as Cloud TPU API quota (tpu.googleapis.com), which is needed when creating TPUs with the Cloud TPU API.
To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:
Go to the Quotas page in the Google Cloud console:
In the
Filter box, do the following:Select the Service property, enter Compute Engine API, and press Enter.
Select the Type property and choose Quota.
Select the Name property and enter the name of the quota based on the TPU version and machine type. For example, if you plan to create on-demand TPU v5e nodes whose machine type begins with
ct5lp-
, enterTPU v5 Lite PodSlice chips
.TPU version Machine type begins with Name of the quota for on-demand instances Name of the quota for Spot2 instances TPU v3 ct3-
TPU v3 Device chips
Preemptible TPU v3 Device chips
TPU v3 ct3p-
TPU v3 PodSlice chips
Preemptible TPU v3 PodSlice chips
TPU v4 ct4p-
TPU v4 PodSlice chips
Preemptible TPU v4 PodSlice chips
TPU v5e ct5l-
TPU v5 Lite Device chips
Preemptible TPU v5 Lite Device chips
TPU v5e ct5lp-
TPU v5 Lite PodSlice chips
Preemptible TPU v5 Lite PodSlice chips
TPU v5p ct5p-
TPU v5p chips
Preemptible TPU v5p chips
TPU Trillium ct6e-
TPU v6e Slice chips
Preemptible TPU v6e Lite PodSlice chips
Select the Dimensions (e.g. locations) property and enter
region:
followed by the name of the region in which you plan to create TPUs in GKE. For example, enterregion:us-west4
if you plan to create TPU slice nodes in the zoneus-west4-a
. TPU quota is regional, so all zones within the same region consume the same TPU quota.
If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota increase.
When a TPU reservation is created, both the limit and current use values for
the corresponding quota increase by the number of chips in the TPU
reservation. For example, when a reservation is created for 16 TPU v5e chips
whose
machine type begins with ct5lp-
,
then both the Limit and
Current usage for the TPU v5 Lite PodSlice chips
quota in the relevant
region increase by 16.
-
When creating a TPU slice node pool, use the
--reservation
and--reservation-affinity=specific
flags to create a reserved instance. TPU reservations are available when purchasing a commitment. ↩ -
When creating a TPU slice node pool, use the
--spot
flag to create a Spot instance. ↩
Quotas for additional GKE resources
You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.
- Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
- In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
- Ensure that
max-pods-per-node
aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example,max-pods-per-node
of 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the--max-pods-per-node
flag to limit the number of pods allowed to be scheduled on a node. The quota formax-pods-per-node
should be set at least as high as the maximum number of GKE nodes you anticipate creating.
To request an increase in quota, see Request higher quota.
Ensure reservation availability
Creating a reserved TPU slice node pool, which consumes a reservation, does not require any TPU quota. However, the reservation must have enough available or unused TPU chips when the node pool is created.
To see which reservations exist within a project, view a list of your reservations.
To view how many TPU chips within a TPU reservation are available, view the details of a reservation.
Options for provisioning TPUs in GKE
GKE lets you use TPUs directly in individual workloads by using Kubernetes nodeSelectors in your workload manifest or by creating Standard mode node pools with TPUs.
Alternatively, you can request TPUs by using custom compute classes. Custom compute classes let platform administrators define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware.
For instructions, see the Provision TPUs using custom compute classes section.
Create a cluster
Create a GKE cluster in Standard mode in a region with available TPUs.
Use regional clusters, which provide high availability of the Kubernetes control plane.
gcloud container clusters create CLUSTER_NAME \
--location LOCATION \
--cluster-version VERSION
Replace the following:
CLUSTER_NAME
: the name of the new cluster.LOCATION
: the region with your TPU capacity available.VERSION
: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. To learn what are the minimum GKE versions available by TPU machine type, see TPU availability in GKE.
Create a node pool
Single-host TPU slice
You can create a single-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.
gcloud
gcloud container node-pools create NODE_POOL_NAME \
--location=LOCATION \
--cluster=CLUSTER_NAME \
--node-locations=NODE_ZONES \
--machine-type=MACHINE_TYPE
Replace the following:
NODE_POOL_NAME
: The name of the new node pool.LOCATION
: The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.CLUSTER_NAME
: The name of the cluster.NODE_ZONE
: The comma-separated list of one or more zones where GKE creates the node pool.MACHINE_TYPE
: The type of machine to use for nodes. For more information about TPU compatible machine types, use the table in Choose the TPU version.
Optionally, you can also use the following flags:
--num-nodes=NUM_NODES
: The initial number of nodes in the node pool in each zone. If you omit this flag,GKE assigns the default of3
,Best practice: If you use the
enable-autoscaling
flag for the node pool, setnum-nodes
to0
so that the autoscaler provisions additional nodes as soon as your workloads demand them.--reservation=RESERVATION_NAME
: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPUs. To learn more about TPU reservations, see TPU reservation.--node-labels cloud.google.com/gke-workload-type=HIGH_AVAILABILITY
: Tells GKE that the single-host TPU slice node pool is part of a collection. Use this flag if the following conditions apply:- The node pool runs inference workload in the new node pool.
- The node pool uses TPU Trillium.
- The node pool doesn't use Spot VMs.
To learn more about collection scheduling management, see Manage collection scheduling in single-host TPU slices.
--enable-autoscaling
: Create a node pool with autoscaling enabled. Requires the following additional flags:--total-min-nodes=TOTAL_MIN_NODES
: Minimum number of all nodes in the node pool.--total-max-nodes=TOTAL_MAX_NODES
: Maximum number of all nodes in the node pool.--location-policy=ANY
: prioritize usage of unused reservations and reduce the preemption risk of Spot VMs.
--spot
: Sets the node pool to use Spot VMs for the nodes in the node pool. This cannot be changed after node pool creation.
For a full list of all the flags that you can specify, see the
gcloud container clusters create
reference.
Terraform
- Ensure that you use the version 4.84.0 or later of the
google
provider. - Add the following block to your Terraform configuration:
resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
provider = google
project = PROJECT_ID
cluster = CLUSTER_NAME
name = POOL_NAME
location = CLUSTER_LOCATION
node_locations = [NODE_ZONES]
node_config {
machine_type = MACHINE_TYPE
reservation_affinity {
consume_reservation_type = "SPECIFIC_RESERVATION"
key = "compute.googleapis.com/reservation-name"
values = [RESERVATION_LABEL_VALUES]
}
spot = true
}
}
Replace the following:
NODE_POOL_RESOURCE_NAME
: The name of the node pool resource in the Terraform template.PROJECT_ID
: Your project ID.CLUSTER_NAME
: The name of the existing cluster.POOL_NAME
: The name of the node pool to create.CLUSTER_LOCATION
: The compute zone(s) of the cluster. Specify the region where the TPU version is available. To learn more, see Select a TPU version and topology.NODE_ZONES
: The comma-separated list of one or more zones where GKE creates the node pool.MACHINE_TYPE
: The type of TPU machine to use. To see TPU compatible machine types, use the table in Choose the TPU version.
Optionally, you can also use the following variables:
autoscaling
: Create a node pool with autoscaling enabled. For single-host TPU slice, GKE scales between theTOTAL_MIN_NODES
andTOTAL_MAX_NODES
values.TOTAL_MIN_NODES
: Minimum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.TOTAL_MAX_NODES
: Maximum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
RESERVATION_NAME
: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate theRESERVATION_LABEL_VALUES
in thereservation_affinity
field, see Terraform Provider.spot
: Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
Console
To create a node pool with TPUs:
Go to the Google Kubernetes Engine page in the Google Cloud console.
In the cluster list, click the name of the cluster you want to modify.
Click add_box Add node pool.
In the Node pool details section, check the Specify node locations box.
Select the zone based on the TPU version you want to use. To identify an available zone, see TPU availability in GKE.
From the navigation pane, click Nodes.
In the Machine Configuration section, select TPUs.
In the Series drop-down menu, select one of the following:
- CT3: TPU v3, single host device
- CT3P: TPU v3, multi host pod slice
- CT4P: TPU v4
- CT5LP: TPU v5e
- CT5P: TPU v5p
- CT6E: TPU Trillium (v6e)
In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a single-host TPU slice node pool.
In the TPU Topology drop-down menu, select the physical topology for the TPU slice.
In the Changes needed dialog, click Make changes.
Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.
Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.
Click Create.
Multi-host TPU slice
You can create a multi-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.
gcloud
gcloud container node-pools create POOL_NAME \
--location=LOCATION \
--cluster=CLUSTER_NAME \
--node-locations=NODE_ZONE \
--machine-type=MACHINE_TYPE \
--tpu-topology=TPU_TOPOLOGY \
--num-nodes=NUM_NODES \
[--spot \]
[--enable-autoscaling \
--max-nodes MAX_NODES]
[--reservation-affinity=specific \
--reservation=RESERVATION_NAME] \
Replace the following:
POOL_NAME
: The name of the new node pool.LOCATION
: The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.CLUSTER_NAME
: The name of the cluster.NODE_ZONE
: The comma-separated list of one or more zones where GKE creates the node pool.MACHINE_TYPE
: The type of machine to use for nodes. To learn more about the available machine types, see Choose the TPU version.TPU_TOPOLOGY
: The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more about TPU topologies, use the table in Choose a topology.To learn more, see Topology.
NUM_NODES
: The number of nodes in the node pool. It must be zero or the product of the values defined inTPU_TOPOLOGY
({A}x{B}x{C}
) divided by the number of chips in each VM. For multi-host TPU v4 and TPU v5e, the number of chips in each VM is four. Therefore, if yourTPU_TOPOLOGY
is2x4x4
(TPU v4 with four chips in each VM), then theNUM_NODES
is 32/4 which equals to 8.
Optionally, you can also use the following flags:
RESERVATION_NAME
: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPU slice node pools. To learn more about TPU reservations, see TPU reservation.--spot
: Sets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.--enable-autoscaling
: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.MAX_NODES
: The maximum size of the node pool. The--max-nodes
flag is required if--enable-autoscaling
is supplied and must be equal to the product of the values defined inTPU_TOPOLOGY
({A}x{B}x{C}
) divided by the number of chips in each VM.
Terraform
- Ensure that you use the version 4.84.0 or later of the
google
provider. Add the following block to your Terraform configuration:
resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" { provider = google project = PROJECT_ID cluster = CLUSTER_NAME name = POOL_NAME location = CLUSTER_LOCATION node_locations = [NODE_ZONES] initial_node_count = NUM_NODES autoscaling { max_node_count = MAX_NODES location_policy = "ANY" } node_config { machine_type = MACHINE_TYPE reservation_affinity { consume_reservation_type = "SPECIFIC_RESERVATION" key = "compute.googleapis.com/reservation-name" values = [RESERVATION_LABEL_VALUES] } spot = true } placement_policy { type = "COMPACT" tpu_topology = TPU_TOPOLOGY } }
Replace the following:
NODE_POOL_RESOURCE_NAME
: The name of the node pool resource in the Terraform template.PROJECT_ID
: Your project ID.CLUSTER_NAME
: The name of the existing cluster to add the node pool to.POOL_NAME
: The name of the node pool to create.CLUSTER_LOCATION
: Compute location for the cluster. We recommend having a regional cluster for higher reliability of the Kubernetes control plane. You can also use a zonal cluster. To learn more, see Select a TPU version and topology.NODE_ZONES
: The comma-separated list of one or more zones where GKE creates the node pool.NUM_NODES
: The number of nodes in the node pool. It must be zero or the product of the number of the TPU chips divided by four, because in multi-host TPU slices each TPU slice node has 4 chips. For example, ifTPU_TOPOLOGY
is4x8
, then there are 32 chips which meansNUM_NODES
must be 8. To learn more about TPU topologies, use the table in Choose the TPU version.TPU_TOPOLOGY
: This indicates the desired physical topology for the TPU slice. The format of the topology depends on the TPU version you are using. To learn more about TPU topologies, use the table in Choose a topology.
Optionally, you can also use the following variables:
RESERVATION_NAME
: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate theRESERVATION_LABEL_VALUES
in thereservation_affinity
field, see Terraform Provider.autoscaling
: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.MAX_NODES
: It is the maximum size of the node pool. It must be equal to the product of the values defined inTPU_TOPOLOGY
({A}x{B}x{C}
) divided by the number of chips in each VM.
spot
: Lets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
Console
To create a node pool with TPUs:
Go to the Google Kubernetes Engine page in the Google Cloud console.
In the cluster list, click the name of the cluster you want to modify.
Click add_box Add node pool.
In the Node pool details section, check the Specify node locations box.
Select the name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.
From the navigation pane, click Nodes.
In the Machine Configuration section, select TPUs.
In the Series drop-down menu, select one of the following:
- CT3P: For TPU v3.
- CT4P: For TPU v4.
- CT5LP: For TPU v5e.
In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a multi-host TPU slice node pool.
In the TPU Topology drop-down menu, select the physical topology for the TPU slice.
In the Changes needed dialog, click Make changes.
Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.
Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.
Click Create.
Provisioning state
If GKE cannot create your TPU slice node pool due to insufficient TPU capacity available, GKE returns an error message indicating the TPU slice nodes cannot be created due to lack of capacity.
If you are creating a single-host TPU slice node pool, the error message looks similar to this:
2 nodes cannot be created due to lack of capacity. The missing nodes will be
created asynchronously once capacity is available. You can either wait for the
nodes to be up, or delete the node pool and try re-creating it again later.
If you are creating a multi-host TPU slice node pool, the error message looks similar to this:
The nodes (managed by ...) cannot be created now due to lack of capacity. They
will be created asynchronously once capacity is available. You can either wait
for the nodes to be up, or delete the node pool and try re-creating it again
later.
Your TPU provisioning request can stay in the queue for a long time and remains in the "Provisioning" state while in the queue.
Once capacity is available, GKE creates the remaining nodes that were not created.
If you need capacity sooner, consider trying Spot VMs, though note that Spot VMs consume different quota than on-demand instances.
You can delete the queued TPU request by deleting the TPU slice node pool.
Run your workload on TPU slice nodes
Workload preparation
TPU workloads have the following preparation requirements.
- Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the
libtpu
shared library.libtpu
includes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certainlibtpu.so
version. To use TPUs in GKE, ensure that you use the following versions:TPU type libtpu.so
versionTPU Trillium (v6e)
tpu-v6e-slice
- Recommended jax[tpu] version: v0.4.9 or later
- Recommended torchxla[tpuvm] version: v2.1.0 or later
TPU v5e
tpu-v5-lite-podslice
tpu-v5-lite-device
- Recommended jax[tpu] version: v0.4.9 or later
- Recommended torchxla[tpuvm] version: v2.1.0 or later
TPU v5p
tpu-v5p-slice
- Recommended jax[tpu] version: 0.4.19 or later.
- Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
TPU v4
tpu-v4-podslice
- Recommended jax[tpu]: v0.4.4 or later
- Recommended torchxla[tpuvm]: v2.0.0 or later
TPU v3
tpu-v3-slice
tpu-v3-device
- Recommended jax[tpu]: v0.4.4 or later
- Recommended torchxla[tpuvm]: v2.0.0 or later
- Set the following environment variables for the container requesting the TPU resources:
TPU_WORKER_ID
: A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.TPU_WORKER_HOSTNAMES
: A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by theTPU_WORKER_ID
.
GKE automatically injects these environment variables by using a mutating webhook when a Job is created with the
When deploying TPU multi-host resources with Kuberay, GKE provides a deployable webhook as part of the Terraform templates for running Ray on GKE. Instructions for running Ray on GKE with TPUs can be found in the TPU User Guide. The mutating webhook will inject these environment variables into Ray clusters requestingcompletionMode: Indexed
,subdomain
,parallelism > 1
, and requestinggoogle.com/tpu
properties. GKE adds a headless Service so that the DNS records are added for the Pods backing the Service.google.com/tpu
properties and a multi-hostcloud.google.com/gke-tpu-topology
node selector. In your workload manifest, add Kubernetes node selectors to ensure that GKE schedules your TPU workload on the TPU machine type and TPU topology you defined:
nodeSelector: cloud.google.com/gke-tpu-accelerator: TPU_ACCELERATOR cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY
Replace the following:
TPU_ACCELERATOR
: The name of the TPU accelerator.TPU_TOPOLOGY
: The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more, see Plan TPUs in GKE.
After you complete the workload preparation, you can run a Job that uses TPUs.
The following sections show examples on how to run a Job that performs simple computation with TPUs.
Example 1: Run a workload that displays the number of available TPU chips in a TPU slice node pool
The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:
- TPU version: TPU v4
- Topology: 2x2x4
This version and topology selection result in a multi-host slice.
- Save the following manifest as
available-chips-multihost.yaml
:apiVersion: v1 kind: Service metadata: name: headless-svc spec: clusterIP: None selector: job-name: tpu-available-chips --- apiVersion: batch/v1 kind: Job metadata: name: tpu-available-chips spec: backoffLimit: 0 completions: 4 parallelism: 4 completionMode: Indexed template: spec: subdomain: headless-svc restartPolicy: Never nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice cloud.google.com/gke-tpu-topology: 2x2x4 containers: - name: tpu-job image: python:3.10 ports: - containerPort: 8471 # Default port using which TPU VMs communicate - containerPort: 8431 # Port to export TPU runtime metrics, if supported. securityContext: privileged: true command: - bash - -c - | pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("TPU cores:", jax.device_count())' resources: requests: cpu: 10 memory: 500Gi google.com/tpu: 4 limits: cpu: 10 memory: 500Gi google.com/tpu: 4
- Deploy the manifest:
kubectl create -f available-chips-multihost.yaml
GKE runs a TPU v4 slice with four VMs (multi-host TPU slice). The slice has 16 interconnected TPU chips.
- Verify that the Job created four Pods:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE tpu-job-podslice-0-5cd8r 0/1 Completed 0 97s tpu-job-podslice-1-lqqxt 0/1 Completed 0 97s tpu-job-podslice-2-f6kwh 0/1 Completed 0 97s tpu-job-podslice-3-m8b5c 0/1 Completed 0 97s
- Get the logs of one of the Pods:
kubectl logs POD_NAME
Replace
POD_NAME
with the name of one of the created Pods. For example,tpu-job-podslice-0-5cd8r
.The output is similar to the following:
TPU cores: 16
Example 2: run a workload that displays the number of available TPU chips in the TPU slice
The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:
- TPU version: TPU v5e
- Topology: 2x4
This version and topology selection result in a single-host slice.
- Save the following manifest as
available-chips-singlehost.yaml
:apiVersion: v1 kind: Pod metadata: name: tpu-job-jax-v5 spec: restartPolicy: Never nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice cloud.google.com/gke-tpu-topology: 2x4 containers: - name: tpu-job image: python:3.10 ports: - containerPort: 8431 # Port to export TPU runtime metrics, if supported. securityContext: privileged: true command: - bash - -c - | pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("Total TPU chips:", jax.device_count())' resources: requests: google.com/tpu: 8 limits: google.com/tpu: 8
- Deploy the manifest:
kubectl create -f available-chips-singlehost.yaml
GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU node has eight TPU chips (single-host TPU slice).
- Get the logs of the Pod:
kubectl logs tpu-job-jax-v5
The output is similar to the following:
Total TPU chips: 8
Upgrade node pools using accelerators (GPUs and TPUs)
GKE automatically upgrades Standard clusters, including node pools. You can also manually upgrade node pools if you want your nodes on a later version sooner. To control how upgrades work for your cluster, use release channels, maintenance windows and exclusions, and rollout sequencing.
You can also configure a node upgrade strategy for your node pool, such as surge upgrades or blue-green upgrades. By configuring these strategies, you can ensure that the node pools are upgraded in a way that achieves the optimal balance between speed and disruption for your environment. For multi-host TPU slice node pools, instead of using the configured node upgrade strategy, GKE atomically recreates the entire node pool in a single step. To learn more, see the definition of atomicity in Terminology related to TPU in GKE.
Using a node upgrade strategy temporarily requires GKE to provision additional resources, depending on the configuration. If Google Cloud has limited capacity for your node pool's resources—for example, you're seeing resource availability errors when trying to create more nodes with GPUs or TPUs—see Upgrade in a resource-constrained environment.
Clean up
To avoid incurring charges to your Google Cloud account for the resources
used in this guide, consider deleting the TPU slice node pools that no longer have
scheduled workloads. If the workloads running must be gracefully
terminated, use kubectl drain
to clean up the workloads before you delete the node.
Delete a TPU slice node pool:
gcloud container node-pools delete POOL_NAME \ --location=LOCATION \ --cluster=CLUSTER_NAME
Replace the following:
POOL_NAME
: The name of the node pool.CLUSTER_NAME
: The name of the cluster.LOCATION
: The compute location of the cluster.
Additional configurations
The following sections describe the additional configurations you can apply to your TPU workloads.
Manage collection scheduling
In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.
Use the following tasks to manage single-host TPU slice node pools.
To check if a single-host TPU slice pool has collection scheduling enabled, run the following command:
gcloud container node-pools describe NODE_POOL_NAME \ --cluster CLUSTER_NAME \ --project PROJECT_NAME \ --format="json" | jq -r '.config.labels["cloud.google.com/gke-workload-type"]'
The output is similar to the following:
gke-workload-type: HIGH_AVAILABILITY
If the single-host TPU slice pool is part of a collection, the output has the
cloud.google.com/gke-workload-type: HIGH_AVAILABILITY
label.To scale up the collection, resize the node pool manually or automatically with node auto-provisioning.
To scale down the collection, delete the node pool.
To delete the collection, remove all of the attached node pools. You can delete the node pool or delete the cluster. Deleting the cluster removes all of the collections in it.
Multislice
You can aggregate smaller slices together in a Multislice to handle larger training workloads. For more information, see Multislice TPUs in GKE.
Migrate your TPU reservation
If you have existing TPU reservations, you must first migrate your TPU reservation to a new Compute Engine-based reservation system. You can also create Compute Engine-based reservation system where no migration is needed. To learn how to migrate your TPU reservations, see TPU reservation.
Logging
Logs emitted by containers running on GKE nodes, including TPU VMs, are collected by the GKE logging agent, sent to Logging, and are visible in Logging.
Use GKE node auto-provisioning
You can configure GKE to automatically create and delete node pools to meet the resource demands of your TPU workloads. For more information, see Configuring Cloud TPUs.
Provision TPUs by using custom compute classes
You can also configure GKE to request TPUs during scaling operations that create new nodes by using custom compute classes.
You can specify TPU configuration options in your custom compute class specification. When a GKE workload uses that custom compute class, GKE attempts to provision TPUs that use your specified configuration when scaling up.
To provision TPUs with a custom compute class, do the following:
Ensure that your cluster has an available custom compute class that selects TPUs. To learn how to specify TPUs in custom compute classes, see TPU rules.
Save the following manifest as
tpu-job.yaml
:apiVersion: v1 kind: Service metadata: name: headless-svc spec: clusterIP: None selector: job-name: tpu-job --- apiVersion: batch/v1 kind: Job metadata: name: tpu-job spec: backoffLimit: 0 completions: 4 parallelism: 4 completionMode: Indexed template: spec: subdomain: headless-svc restartPolicy: Never nodeSelector: cloud.google.com/compute-class: TPU_CLASS_NAME containers: - name: tpu-job image: python:3.10 ports: - containerPort: 8471 # Default port using which TPU VMs communicate - containerPort: 8431 # Port to export TPU runtime metrics, if supported. command: - bash - -c - | pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("TPU cores:", jax.device_count())' resources: requests: cpu: 10 memory: 500Gi google.com/tpu: NUMBER_OF_CHIPS limits: cpu: 10 memory: 500Gi google.com/tpu: NUMBER_OF_CHIPS
Replace the following:
TPU_CLASS_NAME
: the name of the existing custom compute class that specifies TPUs.NUMBER_OF_CHIPS
: the number of TPU chips for the container to use. Must be the same value forlimits
andrequests
, equal to the value in thetpu.count
field in the selected custom compute class.
Deploy the Job:
kubectl create -f tpu-workload.yaml
When you create this Job, GKE automatically does the following:
- Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices. Depending on the availability of TPU resources in the top priority, GKE might fall back to lower priorities to maximize obtainability.
- Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.
To learn more, see About custom compute classes.
TPU slice node auto repair
If a TPU slice node in a multi-host TPU slice node pool is unhealthy, the entire node pool is recreated. Whereas, In a single-host TPU slice node pool, only the unhealthy TPU node is auto-repaired.
Conditions that result in unhealthy TPU slice nodes include the following:
- Any TPU slice node with common node conditions.
- Any TPU slice node with an unallocatable TPU count larger than zero.
- Any VM instance in a TPU slice that is stopped (due to preemption) or is terminated.
- Node maintenance: If any TPU slice node within a multi-host TPU slice node pool goes down for host maintenance, GKE recreates the entire TPU slice node pool.
You can see the repair status (including the failure reason) in the operation history. If the failure is caused by insufficient quota, contact your Google Cloud account representative to increase the corresponding quota.
Configure TPU slice node graceful termination
In GKE clusters with the control plane running 1.29.1-gke.1425000
or later, TPU slice nodes support SIGTERM
signals that alert the node of an imminent
shutdown. The imminent shutdown notification is configurable up to five minutes
in TPU nodes.
To configure GKE to terminate your workloads gracefully within this notification timeframe, follow the steps in Manage GKE node disruption for GPUs and TPUs.
Run containers without privileged mode
Containers running in nodes in GKE version 1.28 or later don't need to have privileged mode enabled to access TPUs. Nodes in GKE version 1.28 and earlier require privileged mode.
If your TPU slice node is running versions less than 1.28, read the following section:
A container running on a VM in a TPU slice needs access to higher limits on locked
memory so the driver can communicate with the TPU chips over direct memory
access (DMA). To enable this, you must configure a higher
ulimit
. If you want to
reduce the permission scope on your container, complete the following steps:
Edit the
securityContext
to include the following fields:securityContext: capabilities: add: ["SYS_RESOURCE"]
Increase
ulimit
by running the following command inside the container before your setting up your workloads to use TPU resources:ulimit -l 68719476736
For TPU v5e, running containers without privileged mode is available in clusters in version 1.27.4-gke.900 and later.
Observability and metrics
Dashboard
In the Kubernetes Clusters page in the Google Cloud console, the Observability tab displays the TPU observability metrics. For more information, see GKE observability metrics.
The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.
Runtime metrics
In GKE version 1.27.4-gke.900 or later, TPU workloads
that use JAX version
0.4.14
or later and specify containerPort: 8431
export TPU utilization metrics as GKE
system metrics.
The following metrics are available in Cloud Monitoring
to monitor your TPU workload's runtime performance:
- Duty cycle: Percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
- Memory used: Amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
- Memory total: Total accelerator memory in bytes. Sampled every 60 seconds.
These metrics are located in the Kubernetes node (k8s_node
) and Kubernetes
container (k8s_container
) schema.
Kubernetes container:
kubernetes.io/container/accelerator/duty_cycle
kubernetes.io/container/accelerator/memory_used
kubernetes.io/container/accelerator/memory_total
Kubernetes node:
kubernetes.io/node/accelerator/duty_cycle
kubernetes.io/node/accelerator/memory_used
kubernetes.io/node/accelerator/memory_total
Host metrics
In GKE version 1.28.1-gke.1066000 or later, VMs in a TPU slice export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:
- TensorCore utilization: Current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit. The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
- Memory bandwidth utilization: Current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.
These metrics are located in the Kubernetes node (k8s_node
) and Kubernetes
container (k8s_container
) schema.
Kubernetes container:
kubernetes.io/container/accelerator/tensorcore_utilization
kubernetes.io/container/accelerator/memory_bandwidth_utilization
Kubernetes node:
kubernetes.io/container/node/tensorcore_utilization
kubernetes.io/container/node/memory_bandwidth_utilization
For more information, see Kubernetes metrics and GKE system metrics.
Known issues
- Cluster autoscaler might incorrectly calculate capacity for new TPU slice nodes before those nodes report available TPUs. Cluster autoscaler might then perform additional scale up and as a result create more nodes than needed. Cluster autoscaler scales down additional nodes, if they are not needed, after regular scale down operation.
- Cluster autoscaler cancels scaling up of TPU slice node pools that remain in waiting status for more than 10 hours. Cluster Autoscaler retries such scale up operations later. This behavior might reduce TPU obtainability for customers who don't use reservations.
- Non-TPU workloads that have a toleration for the TPU taint can prevent scale down of the node pool if they are being recreated during draining of the TPU slice node pool.
- Memory bandwidth utilization metric is not available for v5e TPUs.
What's next
- Learn more about setting up Ray on GKE with TPUs
- Build large-scale machine learning on Cloud TPUs with GKE
- Serve Large Language Models with KubeRay on TPUs
- Troubleshoot TPUs in GKE