透過 KubeRay 在 GKE 上使用 TPU 提供 LLM


本教學課程說明如何使用 Ray Operator 外掛程式vLLM 服務架構,在 Google Kubernetes Engine (GKE) 上,透過 Tensor 處理單元 (TPU) 提供大型語言模型 (LLM) 服務。

在本教學課程中,您可以在 TPU v5e 或 TPU Trillium (v6e) 上提供 LLM 模型,方法如下:

本指南適用於生成式 AI 客戶、新舊 GKE 使用者、機器學習工程師、MLOps (DevOps) 工程師,或是有意使用 Kubernetes 容器自動化調度管理功能,透過 vLLM 在 TPU 上使用 Ray 提供模型服務的平台管理員。

背景

本節說明本指南中使用的重要技術。

GKE 代管 Kubernetes 服務

Google Cloud 提供各種服務,包括 GKE,非常適合部署及管理 AI/機器學習工作負載。GKE 是代管 Kubernetes 服務,可簡化容器化應用程式的部署、擴充及管理作業。GKE 提供必要的基礎架構,包括可擴充的資源、分散式運算和高效能網路,可處理 LLM 的運算需求。

如要進一步瞭解 Kubernetes 的重要概念,請參閱「開始學習 Kubernetes」。如要進一步瞭解 GKE,以及如何協助您自動處理、管理 Kubernetes 及調度資源,請參閱 GKE 總覽

Ray 運算子

GKE 上的 Ray Operator 外掛程式提供端對端 AI/ML 平台,可服務、訓練及微調機器學習工作負載。在本教學課程中,您將使用 Ray Serve (Ray 中的架構),從 Hugging Face 提供熱門的大型語言模型。

TPU

TPU 是 Google 開發的客製化特殊應用積體電路 (ASIC),用於加速機器學習和 AI 模型,這些模型是使用 TensorFlowPyTorchJAX 等架構建構而成。

本教學課程說明如何在 TPU v5e 或 TPU Trillium (v6e) 節點上提供 LLM 模型,並根據每個模型提供提示的低延遲需求,設定 TPU 拓撲。

vLLM

vLLM 是經過高度最佳化的開放原始碼 LLM 服務架構,可提高 TPU 的服務輸送量,並提供下列功能:

  • 使用 PagedAttention 實作最佳化轉換器
  • 持續批次處理,提升整體放送輸送量
  • 在多個 GPU 上進行張量平行處理和分散式服務

詳情請參閱 vLLM 說明文件

目標

本教學課程包含下列步驟:

  1. 建立含有 TPU 節點集區的 GKE 叢集。
  2. 部署具有單一主機 TPU 節點的 RayCluster 自訂資源。GKE 會將 RayCluster 自訂資源部署為 Kubernetes Pod。
  3. 提供 LLM。
  4. 與模型互動。

您可以視需要設定 Ray Serve 架構支援的下列模型服務資源和技術:

  • 部署 RayService 自訂資源。
  • 使用模型組合功能,組合多個模型。

事前準備

開始之前,請確認您已完成下列工作:

  • 啟用 Google Kubernetes Engine API。
  • 啟用 Google Kubernetes Engine API
  • 如要使用 Google Cloud CLI 執行這項工作,請安裝初始化 gcloud CLI。如果您先前已安裝 gcloud CLI,請執行 gcloud components update,取得最新版本。
  • 如果沒有 Hugging Face 帳戶,請先建立一個。
  • 確認你已取得 Hugging Face 權杖
  • 確認您有權存取要使用的 Hugging Face 模型。 您通常需要簽署協議,並在 Hugging Face 模型頁面中向模型擁有者要求存取權,才能取得這項權限。
  • 請確認您具備下列 IAM 角色
    • roles/container.admin
    • roles/iam.serviceAccountAdmin
    • roles/container.clusterAdmin
    • roles/artifactregistry.writer

準備環境

  1. 確認 Google Cloud 專案有足夠的配額,可供單一主機 TPU v5e 或單一主機 TPU Trillium (v6e) 使用。如要管理配額,請參閱「TPU 配額」。

  2. 在 Google Cloud 控制台中啟動 Cloud Shell 執行個體:
    開啟 Cloud Shell

  3. 複製範例存放區:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  4. 前往工作目錄:

    cd ai-ml/gke-ray/rayserve/llm
    
  5. 設定 GKE 叢集建立作業的預設環境變數:

    Llama-3-8B-Instruct

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    更改下列內容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 存取權杖。
    • REGION:您擁有 TPU 配額的區域。確認您要使用的 TPU 版本在這個區域中可用。詳情請參閱「GKE 中的 TPU 可用性」。
    • ZONE:具有可用 TPU 配額的可用區。
    • VLLM_IMAGE:vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔,或自行建構 TPU 映像檔

    Mistral-7B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
    export TOKENIZER_MODE=mistral
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    更改下列內容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 存取權杖。
    • REGION:您擁有 TPU 配額的區域。確認您要使用的 TPU 版本可在這個區域使用。詳情請參閱「GKE 中的 TPU 可用性」。
    • ZONE:具有可用 TPU 配額的可用區。
    • VLLM_IMAGE:vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔,或自行建構 TPU 映像檔

    Llama 3.1 70B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Llama-3.1-70B"
    export MAX_MODEL_LEN=8192
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    更改下列內容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 存取權杖。
    • REGION:您擁有 TPU 配額的區域。確認您要使用的 TPU 版本可在這個區域使用。詳情請參閱「GKE 中的 TPU 可用性」。
    • ZONE:具有可用 TPU 配額的可用區。
    • VLLM_IMAGE:vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔,或自行建構 TPU 映像檔
  6. 拉取 vLLM 容器映像檔:

    sudo usermod -aG docker ${USER}
    newgrp docker
    docker pull ${VLLM_IMAGE}
    

建立叢集

您可以使用 Ray Operator 外掛程式,在 GKE Autopilot 或標準叢集中,透過 Ray 在 TPU 上提供 LLM。

最佳做法

使用 Autopilot 叢集,享受全代管 Kubernetes 服務。如要選擇最適合工作負載的 GKE 作業模式,請參閱「選擇 GKE 作業模式」。

使用 Cloud Shell 建立 Autopilot 或 Standard 叢集:

Autopilot

  1. 建立啟用 Ray Operator 外掛程式的 GKE Autopilot 叢集:

    gcloud container clusters create-auto ${CLUSTER_NAME}  \
        --enable-ray-operator \
        --release-channel=rapid \
        --location=${COMPUTE_REGION}
    

標準

  1. 建立啟用 Ray Operator 外掛程式的標準叢集:

    gcloud container clusters create ${CLUSTER_NAME} \
        --release-channel=rapid \
        --location=${COMPUTE_ZONE} \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --machine-type="n1-standard-4" \
        --addons=RayOperator,GcsFuseCsiDriver
    
  2. 建立單一主機 TPU 節點集區:

    Llama-3-8B-Instruct

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 會建立機器類型為 ct5lp-hightpu-8t 的 TPU v5e 節點集區。

    Mistral-7B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 會建立機器類型為 ct5lp-hightpu-8t 的 TPU v5e 節點集區。

    Llama 3.1 70B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct6e-standard-8t \
        --num-nodes=1
    

    GKE 會建立機器類型為 ct6e-standard-8t 的 TPU v6e 節點集區。

設定 kubectl 與叢集通訊

如要設定 kubectl 與叢集通訊,請執行下列指令:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

標準

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

為 Hugging Face 憑證建立 Kubernetes 密鑰

如要建立包含 Hugging Face 權杖的 Kubernetes Secret,請執行下列指令:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

建立 Cloud Storage 值區

如要縮短 vLLM 部署啟動時間,並盡量減少每個節點所需的磁碟空間,請使用 Cloud Storage FUSE CSI 驅動程式,將下載的模型和編譯快取掛接到 Ray 節點。

在 Cloud Shell 中執行下列指令:

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

這項指令會建立 Cloud Storage bucket,用來儲存從 Hugging Face 下載的模型檔案。

設定 Kubernetes ServiceAccount 以存取 bucket

  1. 建立 Kubernetes ServiceAccount:

    kubectl create serviceaccount ${KSA_NAME} \
        --namespace ${NAMESPACE}
    
  2. 授予 Kubernetes ServiceAccount Cloud Storage 值區的讀寫權限:

    gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    

    GKE 會為 LLM 建立下列資源:

    1. Cloud Storage bucket,用於儲存下載的模型和編譯快取。Cloud Storage FUSE CSI 驅動程式會讀取 bucket 的內容。
    2. 啟用檔案快取功能的磁碟區,以及 Cloud Storage FUSE 的平行下載功能。
    最佳做法

    視模型內容 (例如權重檔案) 的預期大小而定,使用 tmpfsHyperdisk / Persistent Disk 做為檔案快取的備份。在本教學課程中,您將使用以 RAM 為後端的 Cloud Storage FUSE 檔案快取。

部署 RayCluster 自訂資源

部署 RayCluster 自訂資源,通常包含一個系統 Pod 和多個工作站 Pod。

Llama-3-8B-Instruct

完成下列步驟,建立 RayCluster 自訂資源,部署 Llama 3 8B 指令調整模型:

  1. 檢查 ray-cluster.tpu-v5e-singlehost.yaml 資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 套用資訊清單:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源,其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Mistral-7B

完成下列步驟,建立 RayCluster 自訂資源來部署 Mistral-7B 模型:

  1. 檢查 ray-cluster.tpu-v5e-singlehost.yaml 資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 套用資訊清單:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源,其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Llama 3.1 70B

如要部署 Llama 3.1 70B 模型,請完成下列步驟,建立 RayCluster 自訂資源:

  1. 檢查 ray-cluster.tpu-v6e-singlehost.yaml 資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 套用資訊清單:

    envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源,其中包含 workergroup,並在 2x4 拓撲中包含 TPU v6e 單一主機。

連線至 RayCluster 自訂資源

建立 RayCluster 自訂資源後,您就可以連線至 RayCluster 資源,並開始提供模型服務。

  1. 確認 GKE 是否已建立 RayCluster 服務:

    kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
        --output wide
    

    輸出結果會與下列內容相似:

    NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
    vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###
    

    請等到 STATUSready,且 HEAD POD IPHEAD SERVICE IP 欄中都有 IP 位址。

  2. 建立與 Ray 節點的 port-forwarding 工作階段:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 10001:10001"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &
    
  3. 確認 Ray 用戶端可以連線至遠端 RayCluster 自訂資源:

    docker run --net=host -it ${VLLM_IMAGE} \
    ray list nodes --address http://localhost:8265
    

    輸出結果會與下列內容相似:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
    Stats:
    ------------------------------
    Total: 2
    
    Table:
    ------------------------------
        NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
    0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                               memory: #.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               node:__internal_head__: 1.0
                                                                                               object_store_memory: #.### GiB
    1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                               TPU: 8.0
                                                                                               TPU-v#e-8-head: 1.0
                                                                                               accelerator_type:TPU-V#E: 1.0
                                                                                               memory: ###.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               object_store_memory: ##.### GiB
                                                                                               tpu-group-0: 1.0
    

使用 vLLM 部署模型

使用 vLLM 部署模型:

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

查看 Ray 資訊主頁

您可以透過 Ray 資訊主頁查看 Ray Serve 部署作業和相關記錄。

  1. 按一下 Cloud Shell 工作列右上方的「Web Preview」「網頁預覽」圖示 按鈕。
  2. 按一下「變更通訊埠」,然後將通訊埠編號設為 8265
  3. 按一下「變更並預覽」
  4. 在 Ray 資訊主頁中,按一下「Serve」分頁標籤。

Serve 部署作業的狀態為 HEALTHY 時,模型即可開始處理輸入內容。

提供模型

本指南著重介紹支援文字生成的模型,這項技術可根據提示建立文字內容。

Llama-3-8B-Instruct

  1. 設定通訊埠轉送至伺服器:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 將提示傳送至「服務」端點:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Mistral-7B

  1. 設定通訊埠轉送至伺服器:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 將提示傳送至「服務」端點:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Llama 3.1 70B

  1. 設定通訊埠轉送至伺服器:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 將提示傳送至「服務」端點:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

額外設定

您可以視需要設定 Ray Serve 架構支援的下列模型服務資源和技術:

部署 RayService

您可以使用 RayService 自訂資源,部署本教學課程中的相同模型。

  1. 刪除您在本教學課程中建立的 RayCluster 自訂資源:

    kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
    
  2. 建立 RayService 自訂資源,以部署模型:

    Llama-3-8B-Instruct

    1. 檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 指令會取代資訊清單中的環境變數。

      GKE 會建立 RayService,其中包含 workergroup,而 workergroup 則包含 2x4 拓撲中的 TPU v5e 單一主機。

    Mistral-7B

    1. 檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 指令會取代資訊清單中的環境變數。

      GKE 會建立 RayService,其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

    Llama 3.1 70B

    1. 檢查 ray-service.tpu-v6e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 指令會取代資訊清單中的環境變數。

    GKE 會建立 RayCluster 自訂資源,並在其中部署 Ray Serve 應用程式,然後建立後續的 RayService 自訂資源。

  3. 驗證 RayService 資源的狀態:

    kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu
    

    等待服務狀態變更為 Running

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          1
    
  4. 擷取 RayCluster 標頭服務的名稱:

    SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
        --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
    
  5. 建立 port-forwarding 工作階段至 Ray 標頭,即可查看 Ray 資訊主頁:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    
  6. 查看 Ray 資訊主頁

  7. 提供模型

  8. 清除 RayService 資源:

    kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu
    

使用模型組合編寫多個模型

模型組合是一種技術,可將多個模型組合成單一應用程式。

在本節中,您會使用 GKE 叢集,將兩個模型 (Llama 3 8B IT 和 Gemma 7B IT) 組合成單一應用程式:

  • 第一個模型是助理模型,會回答提示中提出的問題。
  • 第二個模型是摘要模型。助理模型輸出內容會串連至摘要模型輸入內容。最終結果是助理模型回覆的摘要版本。
  1. 如要存取 Gemma 模型,請完成下列步驟:

    1. 登入 Kaggle 平台,簽署授權同意聲明,然後取得 Kaggle API 權杖。在本教學課程中,您會使用 Kubernetes Secret 儲存 Kaggle 憑證。
    2. 前往 Kaggle.com 的模型同意聲明頁面
    3. 如果尚未登入 Kaggle,請先登入。
    4. 按一下「要求存取權」
    5. 在「選擇同意聲明適用的帳戶」部分,選取「透過 Kaggle 帳戶驗證」,使用 Kaggle 帳戶授予同意聲明。
    6. 接受模型條款及細則
  2. 設定環境:

    export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
    export SUMMARIZER_MODEL_ID=google/gemma-7b-it
    
  3. 如果是標準叢集,請建立額外的單一主機 TPU 節點集區:

    gcloud container node-pools create tpu-2 \
      --location=${COMPUTE_ZONE} \
      --cluster=${CLUSTER_NAME} \
      --machine-type=MACHINE_TYPE \
      --num-nodes=1
    

    MACHINE_TYPE 替換為下列任一機器類型:

    • ct5lp-hightpu-8t 來佈建 TPU v5e。
    • ct6e-standard-8t 佈建 TPU v6e。

    Autopilot 叢集會自動佈建必要節點。

  4. 根據要使用的 TPU 版本部署 RayService 資源:

    TPU v5e

    1. 檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

    TPU v6e

    1. 檢查 ray-service.tpu-v6e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      
  5. 等待 RayService 資源的狀態變更為 Running

    kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
    

    輸出結果會與下列內容相似:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          2
    

    在這個輸出內容中,RUNNING 狀態表示 RayService 資源已就緒。

  6. 確認 GKE 是否已為 Ray Serve 應用程式建立服務:

    kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc
    

    輸出結果會與下列內容相似:

    NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
    vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###
    
  7. 建立與 Ray 節點的 port-forwarding 工作階段:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &
    
  8. 將要求傳送至模型:

    curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'
    

    輸出結果會與下列內容相似:

      {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}
    

建構及部署 TPU 映像檔

本教學課程使用 vLLM 的代管 TPU 映像檔。vLLM 提供 Dockerfile.tpu 映像檔,可根據必要的 PyTorch XLA 映像檔建構 vLLM,其中包含 TPU 依附元件。不過,您也可以自行建構及部署 TPU 映像檔,進一步控管 Docker 映像檔的內容。

  1. 建立 Docker 存放區,以便儲存本指南的容器映像檔:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
    gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev
    
  2. 複製 vLLM 存放區:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
  3. 建構映像檔:

    docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu
    
  4. 使用 Artifact Registry 名稱標記 TPU 映像檔:

    export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
    docker tag vllm-tpu ${VLLM_IMAGE}
    

    TAG 替換為要定義的標記名稱。如未指定,Docker 會套用預設的最新標記。

  5. 將映像檔推送至 Artifact Registry:

    docker push ${VLLM_IMAGE}
    

刪除個別資源

如果您使用現有專案,且不想刪除專案,可以刪除個別資源。

  1. 刪除 RayCluster 自訂資源:

    kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu
    
  2. 刪除 Cloud Storage bucket:

    gcloud storage rm -r gs://${GSBUCKET}
    
  3. 刪除 Artifact Registry 存放區:

    gcloud artifacts repositories delete vllm-tpu \
        --location=${COMPUTE_REGION}
    
  4. 刪除叢集:

    gcloud container clusters delete ${CLUSTER_NAME} \
        --location=LOCATION
    

    LOCATION 替換為下列任一環境變數:

    • 如果是 Autopilot 叢集,請使用 COMPUTE_REGION
    • 如果是 Standard 叢集,請使用 COMPUTE_ZONE

刪除專案

如果您是在新 Google Cloud 專案中部署本教學課程,且現在已不再需要該專案,請完成下列步驟來刪除專案:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

後續步驟