Jump to Content
Application Development

Run your AI inference applications on Cloud Run with NVIDIA GPUs

August 21, 2024
Sagar Randive

Product Manager, Google Cloud Serverless

Wenlei (Frank) He

Senior Staff Software Engineer, Google Cloud Serverless

Join us at Google Cloud Next

Early bird pricing available now through Feb 14th.

Register

Developers love Cloud Run for its simplicity, fast autoscaling, scale-to-zero capabilities, and pay-per-use pricing. Those same benefits come into play for real-time inference apps serving open gen AI models. That's why today, we’re adding support for NVIDIA L4 GPUs to Cloud Run, in preview.

This opens the door to many new use cases to Cloud Run developers:

  • Performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chat bots or on-the-fly document summarization, while scaling to handle spiky user traffic. 

  • Serving custom fine-tuned gen AI models, such as image generation tailored to your company's brand, and scaling down to optimize costs when nobody's using them.

  • Speeding up your compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.

As a fully managed platform, Cloud Run lets you run your code directly on top of Google’s scalable infrastructure, combining the flexibility of containers with the simplicity of serverless to help boost your productivity. With Cloud Run, you can run frontend and backend services, batch jobs, deploy websites and applications, and handle queue processing workloads — all without having to manage the underlying infrastructure.

At the same time, many workloads that perform AI inference, especially applications that demand real-time processing, require GPU acceleration to deliver responsive user experiences. With support for NVIDIA GPUs, you can perform on-demand online AI inference using the LLMs of your choice in seconds. With 24GB of vRAM, you can expect fast token rates for models with up to 9 billion parameters, including Llama 3.1(8B), Mistral (7B), Gemma 2 (9B). When your app is not in use, the service automatically scales down to zero so that you are not charged for it.

“With the addition of NVIDIA L4 Tensor GPU and NVIDIA NIM support, Cloud Run provides users a real-time, fast-scaling AI inference platform to help customers accelerate their AI projects and get their solutions to market faster — with minimal infrastructure management overhead.” - Anne Hecht, Senior Director of Product Marketing, NVIDIA

Early customers are excited about the combination of Cloud Run and NVIDIA GPUs.

“Cloud Run's GPU support has been a game-changer for our real-time inference applications. The low cold-start latency is impressive, allowing our models to serve predictions almost instantly, which is critical for time-sensitive customer experiences. Additionally, Cloud Run GPUs maintain consistently minimal serving latency under varying loads, ensuring our generative AI applications are always responsive and dependable — all while effortlessly scaling to zero during periods of inactivity. Overall, Cloud Run GPUs have significantly enhanced our ability to provide fast, accurate, and efficient results to our end users.” - Thomas MENARD, Head of AI - Global Beauty Tech, L’Oreal

“Cloud Run GPUs are hands-down the best way to consume GPU compute on Google Cloud. I love how it provides a high degree of control and customizability using open-source standards (Knative) as well as great observability tools out of the box, together with fully managed infrastructure that scales to zero. And since we can easily migrate to GKE using Knative primitives, there is always an option to get even more control at the cost of higher complexity and maintenance. GPU allocation and startup times were also faster for our use-case compared to most competing services.” - Alex Bielski, Director of Innovation, Chaptr

Using NVIDIA GPUs on Cloud Run

Today, we support attaching one NVIDIA L4 GPU per Cloud Run instance, and you do not need to reserve your GPUs in advance. To start, Cloud Run GPUs are available today in us-central1(Iowa), with availability in europe-west4 (Netherlands) and asia-southeast1 (Singapore) expected before the end of the year. 

To deploy a Cloud Run service with NVIDIA GPUs, add the --gpu=1 flag to specify the number of GPUs and --gpu-type=nvidia-l4 flag to specify the type of GPU in the command line. Or, you can do this from the Google Cloud console:

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/GPU_blog_gif_2.gif

And with the recently announced Cloud Run functions, you can also attach a GPU to your functions to perform event-driven AI inference with simplicity.

"The newly released Cloud Run functions with GPU support enables Python developers to use Hugging Face models without having to worry about infrastructure, GPU drivers or containers. Cloud Run's scales to zero and fast startup capabilities are a great match for developers looking at getting started with AI using HuggingFace models with just a few lines of serverless code” - Julien Chaumond, CTO, Hugging Face

Performance

Along with simple operations, Cloud Run with NVIDIA GPUs also offers strong performance. We keep our infrastructure latency to a minimum so that you can get the best performance when serving your models. 

Cloud Run instances with an attached L4 GPU with driver pre-installed start in approximately 5 seconds, at which point the processes running in your container can start to use the GPU. Then, you’ll need another few seconds for the framework and model to load and initialize. The table below shows cold-start times for Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama3.1 8b models with the Ollama framework, ranging from 11 to 35 seconds. This measures the time to start an instance from 0, load the model in the GPU, and for the LLM to return its first word.

Model

Model Size 

Cold Start Time

gemma:2b

1.7 GB

11-17 seconds

gemma2:9b

5.1 GB

25-30 seconds

llama2:7b

3.8 GB

14-21 seconds

llama2:13b

7.4 GB

23-35 seconds

llama3.1:8b

4.7 GB

15-21 seconds

Cold start time: Time taken for first invocation to the service URL for Cloud Run instance to go from 0-1 and serve the first word of the response.
Models: we used 4 bit quantized versions of each of the models above. These models were deployed using the Ollama framework. 
Note that these numbers are observed in a controlled lab environment and actual performance numbers may vary depending on a variety of factors. “

Deploy a sample app using Ollama

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_lL36B9K.max-300x300.jpg

Below, you can see how to deploy Google’s Gemma2 9b model with Ollama using Cloud Run with NVIDIA GPUs. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Ollama is a framework that provides a simple API to manage large language models. 

First, create a container image with Ollama and the model with this Dockerfile:

Loading...

Then deploy using the following command:

Loading...

And that’s it! Once deployed, you can use the Ollama API to start chatting with Gemma 2!

“Deploying a Large Language Model using Ollama on Cloud Run is remarkably straightforward, thanks to the latest GPU support. With just a few commands, you can leverage Ollama’s seamless integration with your app and Cloud Run’s serverless infrastructure to deploy, and manage your LLMs effortlessly. The fast coldstarts and rapid scaling of Cloud Run let you scale your application reliably. No deep knowledge of infrastructure or machine learning is required — simply focus on your application and let the tools handle the rest.” - Jeffrey Morgan, Founder, Ollama

Additionally, you can also leverage NVIDIA NIM inference microservices, part of the NVIDIA AI Enterprise software suite available in the Google Cloud Marketplace. This provides secure, reliable deployment of high-performance AI model inferencing accelerated to simplify AI inference deployments and maximize performance on NVIDIA L4 GPUs on Cloud Run. Check out this NVIDIA blog to learn how to get started.

Get started today

Cloud Run makes it super easy to host your web applications. And now with GPU support, we are extending the best of serverless, simplicity and scalability to your AI inference applications too! To start using Cloud Run with NVIDIA GPUs, sign up at g.co/cloudrun/gpu to join our preview program today and wait for our welcome email.

To learn more about Cloud Run with GPUs, join this livestream on August 21, 2024 with NVIDIA and Ollama. We will discuss new features for Cloud Run and demo how to use Cloud Run in different scenarios.

Posted in