A typical use of C++ in Google Cloud is to perform parallel computations or data analysis. Once completed, the results of this analysis are stored in some kind of database. In this guide we will build such an application, we will use "scanning" GCS as a simplified example of "analyzing" some data, Cloud Spanner as the database to store the results, and and deploy the application to Cloud Run, a managed platform to deploy containerized applications.
Google Cloud Storage (GCS) buckets can contain thousands, millions, and even billions of objects. GCS can quickly find an object given its name, or list objects with names in a given range, but some applications need more advanced lookups. For example, one may be interested in finding all the objects within a certain size, or with a given object type.
In this guide, we will create and deploy an application to scan all the objects in a bucket, and store the full metadata information of each object in a Cloud Spanner instance. Once the information is in a Cloud Spanner table, one can use normal SQL statements to search for objects.
The basic structure of this application is shown below. We will create a deployment that scans the object metadata in Cloud Storage. To schedule work for this deployment we will use Cloud Pub/Sub as a job queue. Initially the user posts an indexing request to Cloud Pub/Sub, asking to index all the objects with a given "prefix" (often thought of a folder) in a GCS bucket. If a request fails or times out, Cloud Pub/Sub will automatically resend it to a new instance.
If the work can be broken down by breaking the folder into smaller subfolders the indexing job will do so. It will simply post the request to index the subfolder to itself (though it may be handled by a different instance as the job scales up). As the number of these requests grows, Cloud Run will automatically scale up the indexing deployment. We do not need to worry about scaling up the job, or scaling it down at the end. In fact, Cloud Run can "scale down to zero", so we do not even need to worry about shutting it down.
This example assumes that you have an existing GCP (Google Cloud Platform) project. The project must have billing enabled, as some of the services used in this example require it. If needed, consult:
- the GCP quickstarts to setup a GCP project
- the cloud run quickstarts to setup Cloud Run in your project
Use your workstation, a GCE instance, or the Cloud Shell to get a command-line prompt. If needed, login to GCP using:
gcloud auth login
Throughout the example we will use GOOGLE_CLOUD_PROJECT
as an environment
variable containing the name of the project.
export GOOGLE_CLOUD_PROJECT=[PROJECT ID]
⚠️ this guide uses Cloud Spanner, this service is billed by the hour even if you stop using it. The charges can reach the hundreds or thousands of dollars per month if you configure a large Cloud Spanner instance. Consult the Pricing Calculator for details. Please remember to delete any Cloud Spanner resources once you no longer need them.
We will issue a number of commands using the [Google Cloud SDK], a
command-line tool to interact with Google Cloud services. Specifying the project
(via the --project=$GOOGLE_CLOUD_PROJECT
flag) on each invocation of this tool
quickly becomes tedious. We start by configuring the default project:
gcloud config set project $GOOGLE_CLOUD_PROJECT
# Output: Updated property [core/project].
Some services are not enabled by default when you create a Google Cloud Project. We enable all the services we will need in this guide using:
gcloud services enable run.googleapis.com
gcloud services enable cloudbuild.googleapis.com
gcloud services enable containerregistry.googleapis.com
gcloud services enable container.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable spanner.googleapis.com
# Output: nothing if the services are already enabled.
# for services that are not enabled something like this
# Operation "operations/...." finished successfully.
So far, we have not created any C++ code. It is time to compile and deploy our application. Obtain the code from GitHub:
git clone https://github.com/GoogleCloudPlatform/cpp-samples
# Output: Cloning into 'cpp-samples'...
# additional informational messages
Change your working directory to the code location:
cd cpp-samples/getting-started
# Output: none
Compile the code into a Docker image. Since we are only planning to build this example once, we will use Cloud Build. Using Cloud Build is simpler, but it does not create a cache of the intermediate build artifacts. Read about buildpacks and the pack tool install guide to run your builds locally and cache intermediate artifacts. You can also use Container Registry as a shared cache for buildpacks, both between workstations and for your CI systems. To learn more about this, consult the buildpack documentation for cache images.
You can continue with other steps while this build runs in the background. Optionally, use the links in the output to follow the build process in your web browser.
gcloud builds submit \
--async \
--machine-type=e2-highcpu-32 \
--pack image="gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/index-gcs-prefix"
# Output:
# Creating temporary tarball archive of 10 file(s) totalling 58.1 KiB before compression.
# Uploading tarball of [.] to [gs://....tgz]
# Created [https://cloudbuild.googleapis.com/v1/projects/....].
# Logs are available at [...].
As mentioned above, this guide uses Cloud Spanner to store the data. We create the smallest possible instance. If needed we will scale up the instance, but this is economical and enough for running small jobs.
⚠️ Creating the Cloud Spanner instance incurs immediate billing costs, even if the instance is not used.
gcloud beta spanner instances create getting-started-cpp \
--config=regional-us-central1 \
--processing-units=100 \
--description="Getting Started with C++"
# Output: Creating instance...done.
A Cloud Spanner instance is just the allocation of compute resources for your databases. Think of them as a virtual set of database servers. Initially these servers have no databases or tables associated with the resources. We need to create a database and table that will host the data for this guide:
gcloud spanner databases create gcs-index \
--ddl-file=gcs_objects.sql \
--instance=getting-started-cpp
# Output: Creating database...done.
Publishers send messages to Cloud Pub/Sub using a topic. These are named, persistent resources. We need to create one to configure the application.
gcloud pubsub topics create gcs-indexing-requests
# Output: Created topic [projects/..../topics/gcs-indexing-requests].
PROJECT_NUMBER=$(gcloud projects list \
--filter="project_id=$GOOGLE_CLOUD_PROJECT" \
--format="value(project_number)" \
--limit=1)
# Output: none
gcloud projects add-iam-policy-binding "$GOOGLE_CLOUD_PROJECT" \
--member=serviceAccount:service-$PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com \
--role=roles/iam.serviceAccountTokenCreator
# Output: Updated IAM policy for project [$GOOGLE_CLOUD_PROJECT].
# bindings:
# ... full list of bindings for your project's IAM policy ...
Look at the status of your build using:
gcloud builds list --ongoing
# Output: the list of running jobs
If your build has completed the list will be empty. If you need to wait for this build to complete (it should take about 15 minutes) use:
gcloud builds log --stream $(gcloud builds list --ongoing --format="value(id)")
# Output: the output from the build, streamed.
⚠️ To continue, you must wait until the Cloud Build build completed.
Once the image is uploaded, we can create a Cloud Run deployment to run it. This starts up an instance of the job. Cloud Run will scale this up or down as needed:
gcloud run deploy index-gcs-prefix \
--image="gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/index-gcs-prefix:latest" \
--set-env-vars="SPANNER_INSTANCE=getting-started-cpp,SPANNER_DATABASE=gcs-index,TOPIC_ID=gcs-indexing-requests,GOOGLE_CLOUD_PROJECT=$GOOGLE_CLOUD_PROJECT" \
--region="us-central1" \
--platform="managed" \
--no-allow-unauthenticated
# Output: Deploying container to Cloud Run service [index-gcs-prefix] in project [....] region [us-central1]
# Service [gcs-indexing-worker] revision [index-gcs-prefix-00001-yeg] has been deployed and is serving 100 percent of traffic.
# Service URL: https://index-gcs-prefix-...run.app
We need the URL of this deployment to finish the Cloud Pub/Sub configuration:
URL="$(gcloud run services describe index-gcs-prefix \
--region="us-central1" --format="value(status.url)")"
# Output: none
Create a push subscription. This sends Cloud Pub/Sub messages as HTTP requests to the Cloud Run deployment. We use the previously created service account to make the HTTP request, and allow up to 10 minutes for the request to complete before Cloud Pub/Sub retries on a different instance.
gcloud pubsub subscriptions create indexing-requests-cloud-run-push \
--topic="gcs-indexing-requests" \
--push-endpoint="$URL" \
--push-auth-service-account="$PROJECT_NUMBER[email protected]" \
--ack-deadline=600
# Output: Created subscription [projects/$GOOGLE_CLOUD_PROJECT/subscriptions/indexing-requests-cloud-run-push].
This will request indexing some public data. The prefix contains less than 100 objects:
gcloud pubsub topics publish gcs-indexing-requests \
--attribute=bucket=gcp-public-data-landsat,prefix=LC08/01/006/001
# Output: messageIds:
# - '....'
The data should start appearing in the Cloud Spanner database. We can use the
gcloud
tool to query this data.
gcloud spanner databases execute-sql gcs-index --instance=getting-started-cpp \
--sql="select * from gcs_objects where name like '%.txt' order by size desc limit 10"
# Output: metadata for the 10 largest objects with names finishing in `.txt`
⚠️ The following steps will incur significant billing costs. Use the Pricing Calculator to estimate the costs. If you are uncertain as to these costs, skip to the Cleanup Section.
To scan a larger prefix we will need to scale up the Cloud Spanner instance. We
use a gcloud
command for this:
gcloud beta spanner instances update getting-started-cpp --processing-units=3000
# Output: Updating instance...done.
and then index a prefix with a few thousand objects:
gcloud pubsub topics publish gcs-indexing-requests \
--attribute=bucket=gcp-public-data-landsat,prefix=LC08/01/006
# Output: messageIds:
# - '....'
You can monitor the work queue using the console:
google-chrome "https://console.cloud.google.com/cloudpubsub/subscription/detail/indexing-requests-cloud-run-push?project=$GOOGLE_CLOUD_PROJECT"
Or count the number of indexed objects:
gcloud spanner databases execute-sql gcs-index --instance=getting-started-cpp \
--sql="select count(*) from gcs_objects"
# Output:
# (Unspecified) --> the count(*) column name
# 225446 --> the number of rows in the `gcs_objects` table (the actual number may be different)
It is also interesting to see the number of instances for the job:
google-chrome https://pantheon.corp.google.com/run/detail/us-central1/index-gcs-prefix/metrics?project=$GOOGLE_CLOUD_PROJECT
- Automatically update the index as the bucket changes.
- Learn about how to deploy similar code to GKE
⚠️ Do not forget to cleanup your billable resources after going through this "Getting Started" guide.
gcloud spanner databases delete gcs-index --instance=getting-started-cpp --quiet
# Output: none
gcloud spanner instances delete getting-started-cpp --quiet
# Output: none
gcloud run services delete index-gcs-prefix \
--region="us-central1" \
--platform="managed" \
--quiet
# Output:
# Deleting [index-gcs-prefix]...done.
# Deleted [index-gcs-prefix].
gcloud pubsub subscriptions delete indexing-requests-cloud-run-push --quiet
# Output: Deleted subscription [projects/$GOOGLE_CLOUD_PROJECT/subscriptions/indexing-requests-cloud-run-push].
gcloud pubsub topics delete gcs-indexing-requests --quiet
# Output: Deleted topic [projects/$GOOGLE_CLOUD_PROJECT/topics/gcs-indexing-requests].
gcloud container images delete gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/index-gcs-prefix:latest --quiet
# Output: Deleted [gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/index-gcs-prefix:latest]
# Output: Deleted [gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/index-gcs-prefix@sha256:....]