Google Cloud Run speeds up on-demand AI inference with Nvidia’s L4 GPUs

Google Cloud is giving developers an easier way to get their artificial intelligence applications up and running in the cloud, with the addition of graphics processing unit support on the Google Cloud Run serverless platform.

The company said in a blog post today that it’s adding support for Nvidia’s L4 graphics processing units on Google Cloud Run in preview in a limited number of regions, ahead of a wider rollout in future.

First unveiled in 2019, Google Cloud Run is a fully managed, serverless computing platform that makes it easy for developers to launch applications, websites and online workflows. With Cloud Run, developers simply upload their code as a stateless container into a serverless environment, so there’s no need to worry about infrastructure management.

It differs from other cloud computing platforms because everything is fully managed. Though some developers appreciate the cloud because it provides the ability to fine-tune the way their computing environments are configured, not everyone wants to bother with this.

Cloud Run does all of the heavy lifting for developers, so they don’t have to ponder over their compute and storage requirements or worry about configurations and provisioning. It also eliminates the risk of overprovisioning and paying for more computing resources than what developers actually use, thanks to its pay-per-use pricing model, and it naturally requires fewer people to get a new application or website up and running.

On-demand AI inference

In a blog post, Google Cloud Serverless Product Manager Sagar Randive said his team realized that Cloud Run’s benefits make it an ideal option for running real-time AI inference applications that serve generative AI models. So that’s why the company is introducing support for Nvidia’s L4 GPUs.

With support for Nvidia’s GPUs, Cloud Run users can perform on-demand online AI inference using any large language model they want, in a matter of seconds.

“With 24GB of vRAM, you can expect fast token rates for models with up to 9 billion parameters, including Llama 3.1(8B), Mistral (7B), Gemma 2 (9B),” Randive said. “When your app is not in use, the service automatically scales down to zero so that you are not charged for it.”

The company believes that GPU support makes Cloud Run a more viable option for various AI workloads, including inference tasks with lightweight LLMs such as Gemma 2B, Gemma 7B or Llama-3 8B. In turn, this paves the way for developers to build and launch customized chatbots or AI summarization models that can scale to handle spikes in traffic.

Other use cases include serving customized and fine-tuned generative AI models, such as a scalable and cost-effective image generator that’s tailored for a company’s brand. In addition, the Cloud Run GPUs also support non-AI tasks such as on-demand image recognition, video transcoding, streaming and 3D rendering, Google said.

Nvidia’s L4 GPUs are available in preview on Google Cloud Run now in the us-central1(Iowa) region, and will launch in europe-west4 (Netherlands) and asia-southeast1 (Singapore) by the end of the year. The service supports a single L4 GPU per instance, and there’s no need to reserve the GPU in advance, Google said.

A handful of customers have already been lucky enough to pilot the new offering, including the cosmetics and beauty products giant L’Oréal S.A., which is using GPUs on Cloud Run to power a number of its real-time inference applications.

“The low cold-start latency is impressive, allowing our models to serve predictions almost instantly, which is critical for time-sensitive customer experiences,” said Thomas Menard, head of AI at L’Oreal. “Cloud Run GPUs maintain consistently minimal serving latency under varying loads, ensuring our generative AI applications are always responsive and dependable.”

On-demand AI inference

Related stories

Other stories