Oracle to offer ‘world’s first’ zettascale AI cluster via its public cloud

Oracle Corp. plans to equip its public cloud with what it describes as the world’s first zettascale computing cluster.

The cluster, which the company previewed today at its Oracle CloudWorld conference, will provide up to 2.4 zettaflops of performance for artificial intelligence workloads. One zettaflop equals a trillion billion computing operations per second. The speed of the world’s fastest supercomputers is typically measured in exaflops, units of compute that are three orders of magnitude smaller than a zettaflop.

Under the hood, the AI cluster is based on Nvidia Corp.’s flagship Blackwell B200 graphics processing unit. The cluster can reach its 2.4-zettaflop top speed when customers provision it with 131,072 B200 chips, the maximum GPU count that Oracle plans to support. That’s more than three times the number of graphics cards in the world’s fastest supercomputer, a system called Frontier that the U.S. Energy Department uses for scientific research.

The B200 comprises two separate compute modules, or dies, that are made using a four-nanometer manufacturing process. They’re linked together by an interconnect that can transfer up to 10 terabytes of data between them every second. The B200 also features 192 gigabytes worth of HBM3e memory, a type of high-speed RAM, that brings its total transistor count to 208 billion.

One of the chip’s flagship features is a so-called microscaling capability. AI models process information in the form of floating point numbers, units of data that contain four to 32 bits’ worth of information. The smaller the unit of data, the less time it takes to process. The B200’s microscaling capability can compress some floating point numbers into smaller ones and thereby speeding up calculations.

Oracle’s B200-powered AI cluster will support two networking protocols: InfiniBand and RoCEv2, an enhanced version of Ethernet. Both technologies include so-called kernel bypass features, which allow network traffic to bypass some of the components tat it must usually go through to reach its destination. This arrangement enables data to reach GPUs faster and thereby speeds up processing.

The B200 cluster will become available in the first quarter of 2025. Around the same time, Oracle plans to expand its public cloud with another new infrastructure option based on Nvidia’s GB200 NVL72 system. It’s a liquid-cooled compute appliance that ships with 36 GB200 accelerators, which each include two B200 graphics cards and one central processing unit.

The GB200 supports an Nvidia networking technology called SHARP. AI chips must regularly exchange data with one another over the network that links them to coordinate their work. Moving data in this manner consumes some of the chips’ processing power. SHARP reduces the amount of information that has to be sent over the network, which reduces the associated processing requirements and leaves more GPU capacity for AI workloads.

“We include supercluster monitoring and management APIs to enable you to quickly query for the status of each node in the cluster, understand performance and health, and allocate nodes to different workloads, greatly enhancing availability,” Mahesh Thiagarajan, executive vice president of Oracle Cloud Infrastructure, detailed in a blog post.

Oracle is also enhancing its cloud platform’s support for other chips in Nvidia’s product portfolio. Later this year, the company plans to add a new cloud cluster based on the H200, the chip that headlined Nvidia’s data center GPU lineup until the debut of the B200 in March. The cluster will allow users to provision up to 65,536 H200 chips for 260 exaflops, or just over a quarter zettaflop, of performance.

Thiagarajan detailed that Oracle is also upgrading its cloud’s storage infrastructure to accommodate the new AI clusters. Large-scale neural networks require the ability to quickly move data to and from storage while running calculations.

“We will also soon introduce a fully managed Lustre file service that can support dozens of terabits per second,” Thiagarajan wrote. “To match the increased storage throughput, we’re increasing the OCI GPU Compute frontend network capacity from 100 Gbps in the H100 GPU-accelerated instances to 200 Gbps with H200 GPU-accelerated instances, and 400 Gbps per instance for B200 GPU and GB200 instances.”

Photo: Oracle

Related stories

Other stories