pwshub.com

Oracle to offer ‘world’s first’ zettascale AI cluster via its public cloud

Oracle Corp. plans to equip its public cloud with what it describes as the world’s first zettascale computing cluster.

The cluster, which the company previewed today at its Oracle CloudWorld conference, will provide up to 2.4 zettaflops of performance for artificial intelligence workloads. One zettaflop equals a trillion billion computing operations per second. The speed of the world’s fastest supercomputers is typically measured in exaflops, units of compute that are three orders of magnitude smaller than a zettaflop. 

Under the hood, the AI cluster is based on Nvidia Corp.’s flagship Blackwell B200 graphics processing unit. The cluster can reach its 2.4-zettaflop top speed when customers provision it with 131,072 B200 chips, the maximum GPU count that Oracle plans to support. That’s more than three times the number of graphics cards in the world’s fastest supercomputer, a system called Frontier that the U.S. Energy Department uses for scientific research. 

The B200 comprises two separate compute modules, or dies, that are made using a four-nanometer manufacturing process. They’re linked together by an interconnect that can transfer up to 10 terabytes of data between them every second. The B200 also features 192 gigabytes worth of HBM3e memory, a type of high-speed RAM, that brings its total transistor count to 208 billion.

One of the chip’s flagship features is a so-called microscaling capability. AI models process information in the form of floating point numbers, units of data that contain four to 32 bits’ worth of information. The smaller the unit of data, the less time it takes to process. The B200’s microscaling capability can compress some floating point numbers into smaller ones and thereby speeding up calculations.

Oracle’s B200-powered AI cluster will support two networking protocols: InfiniBand and RoCEv2, an enhanced version of Ethernet. Both technologies include so-called kernel bypass features, which allow network traffic to bypass some of the components tat it must usually go through to reach its destination. This arrangement enables data to reach GPUs faster and thereby speeds up processing. 

The B200 cluster will become available in the first quarter of 2025. Around the same time, Oracle plans to expand its public cloud with another new infrastructure option based on Nvidia’s GB200 NVL72 system. It’s a liquid-cooled compute appliance that ships with 36 GB200 accelerators, which each include two B200 graphics cards and one central processing unit.

The GB200 supports an Nvidia networking technology called SHARP. AI chips must regularly exchange data with one another over the network that links them to coordinate their work. Moving data in this manner consumes some of the chips’ processing power. SHARP reduces the amount of information that has to be sent over the network, which reduces the associated processing requirements and leaves more GPU capacity for AI workloads. 

“We include supercluster monitoring and management APIs to enable you to quickly query for the status of each node in the cluster, understand performance and health, and allocate nodes to different workloads, greatly enhancing availability,” Mahesh Thiagarajan, executive vice president of Oracle Cloud Infrastructure, detailed in a blog post.

Oracle is also enhancing its cloud platform’s support for other chips in Nvidia’s product portfolio. Later this year, the company plans to add a new cloud cluster based on the H200, the chip that headlined Nvidia’s data center GPU lineup until the debut of the B200 in March. The cluster will allow users to provision up to 65,536 H200 chips for 260 exaflops, or just over a quarter zettaflop, of performance.

Thiagarajan detailed that Oracle is also upgrading its cloud’s storage infrastructure to accommodate the new AI clusters. Large-scale neural networks require the ability to quickly move data to and from storage while running calculations. 

“We will also soon introduce a fully managed Lustre file service that can support dozens of terabits per second,” Thiagarajan wrote. “To match the increased storage throughput, we’re increasing the OCI GPU Compute frontend network capacity from 100 Gbps in the H100 GPU-accelerated instances to 200 Gbps with H200 GPU-accelerated instances, and 400 Gbps per instance for B200 GPU and GB200 instances.”

Photo: Oracle

Source: siliconangle.com

Related stories
6 days ago - This was the week that Apple finally infused artificial intelligence into its new iPhones, Watches and AirPods, though some of features won’t be coming for a bit and overall, the AI stuff seemed a little underwhelming. The medical...
5 days ago - Oracle Corp. is seeing renewed business momentum powered by a combination of an entrenched database business, years of investment in cloud infrastructure, an integrated application suite and artificial intelligence technologies that are...
1 week ago - Investors are gearing up for a consumer inflation print seen as crucial to determining the size of the first US interest-rate cut in years.
1 week ago - Investors are gearing up for a consumer inflation print seen as crucial to determining the size of the first US interest-rate cut in years.
1 week ago - It’s no surprise that entrepreneurs with a pedigree like Ilya Sutskever’s can raise a billion dollars, as the OpenAI co-founder did this week for his startup, SSI. And he wasn’t alone, as Nvidia and others also invested in two other...
Other stories
12 minutes ago - Ransomware has quickly grown into a multi-billion-dollar industry, forcing a shift in how cybersecurity is approached, including the development of solutions such as Mandiant Threat Intelligence. In the last five years, as profits for...
12 minutes ago - There is disruption underway in the cloud industry itself as businesses begin to look outside of the major providers to support private artificial intelligence and AI cloud services. The growth of AI has led to a need for infrastructure...
12 minutes ago - The reach of enterprise technologies such as artificial intelligence has permeated every business operations area. Given the resulting explosion in organizational data generation and reliance, the surface for cyberattacks has expanded....
12 minutes ago - Deepgram Inc., the developer of a speech recognition engine that provides its service via application programming interfaces, today announced a powerful addition to its platform that enables natural-sounding conversations between humans...
41 minutes ago - Trump maintains a roughly 60% stake in Trump Media & Technology Group, which trades on the Nasdaq under the ticker symbol "DJT."