Nvidia admits Blackwell defect, but says it's fine now

Nvidia has confirmed earlier reports that its Blackwell generation of GPUs suffered from a design defect that adversely impacted the yields of the hotly anticipated accelerators.

"We executed a change to the Blackwell GPU mask to improve production yields," Nvidia CFO Colette Kress acknowledged on the companuy’s Wednesday's earnings call.

Neither Kress nor CEO Jensen Huang explained the nature of the defect beyond the mention of a mask. However, whatever was wrong with the manufacturing process will stop Nvidia from shipping Blackwell in Q4 as previously promised.

"The change to the mask is complete. There were no functional changes necessary. And so, we're sampling functional samples of Blackwell — Grace Blackwell in a variety of system configurations as we speak," Huang told investors.

Even if Nvidia manages to ship Blackwell on time, the change could simply mean a later ramp and revenue realization than previously planned. Even still, Kress insisted that "Blackwell production ramp is scheduled to begin in the fourth quarter and continue into fiscal year '26," and that in Q4, the GPU giant will "ship several billion dollars in Blackwell revenue."

Huang doubled down in Q&A: "When I said ship production in Q4, I meant shipping out. I don't mean starting to ship … I don't mean starting production."

While Huang and Kress attempted to play off the mask swap as a minor matter, Gartner analyst Gaurav Gupta told The Register that design changes like these can be extremely costly, especially this late in production.

"Typically, you want these issues to be resolved early on," he said.

Gupta noted that even if Blackwell is late, Nvidia has two advantages: one is that, despite recent gains by AMD, Intel, and others, Nvidia still doesn't have much competition. The second factor is that the mask change could end up saving Nvidia a considerable amount of cash in the long run.

"If the new mask improves production yield, they will easily recover any loss due to delays. In chip fabrication, achieving high yields is critical, so less fabricated chips are discarded and production is more efficient and reliable, and it also helps improve cycle time."

Nvidia's disclosure mirrors similar statements made by Foxconn execs earlier this month that Grace-Blackwell-based products would begin shipping in small volumes in the fourth quarter.

"We are on track to develop and prepare the manufacturing of the new AI server to start shipping in small volumes in the last quarter of 2024, and increase the production volume in the first quarter of next year," Foxconn spokesperson James Wu previously said.

Announced at Nvidia’s GTC conference in March 2024, the Blackwell generation of GPUs boast more than twice the VRAM of the H100 and 2.5-5x higher performance. Achieving this performance uplift requires some major changes in design philosophy including the move to a multi-die configuration – an approach similar to that used by AMD and Intel in their latest generation of accelerators.

Making matters more complicated is that Nvidia is pushing its Grace-Blackwell Superchips, AKA the GB200, much harder this generation. The 2,700W parts pair two Blackwell GPUs with a single 72-core Grace CPU. Thirty-six of these superchips form its 120kW NVL72 rack systems, which Nvidia claims offer a 30x improvement in inference performance thanks to the speedy NVLink switch fabric tying everything together. We took a deep dive on that system back at GTC if you're curious.

Nvidia's growth slows to a mere 122 percent but it's still topping expectations
Cerebras gives waferscale chips inferencing twist, claims 1,800 token per sec generation rates
Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands
All that new AI-fueled datacenter space? Yeah, that's mostly ours – cloud giants

However, it appears something about the design wasn't quite right, necessitating the mask change. As we previously reported, Nvidia had supposedly warned Microsoft that shipments of the chips had been delayed due to a problem with the packaging tech used to stitch Blackwell's two dies together.

Nvidia was therefore said to be prioritizing its flagship GB200 parts over the lower spec HGX B100 and B200 configurations and would bring a trimmed-down Blackwell config with a single compute die called the B200A to market as a stop gap measure.

And while Huang's comments seem to confirm Nvidia is prioritizing its GB200 SKUs, those are also the chips that offer the highest efficiency when running large models. As it stands, Nvidia's HGX/DGX H100 platforms, with their eight GPUs, struggle to support models with more than a few hundred billion parameters. Llama 3 405B can only run on an HGX H100 system when quantized to 8-bit precision. Nvidia's top-specced Blackwell systems should be able to support models more than 10x that in size.

As our sibling site The Next Platform points out, not disclosing larger issues at this point would likely draw the ire of investigators and the SEC. Considering Nvidia is already under the Department of Justice's looking glass, it seems transparency would be the best play here. So while there's clearly some truth to Blackwell's manufacturing defects, it may not be as severe as rumored.

In any case, it's clear that Nvidia's leadership is keen to telegraph that all is fine and well this week. Amid the Hot Chips conference this week, the AI infrastructure goliath revealed that Blackwell was performing well enough to merit an MLPerf submission, which it claims demonstrated 4x lead over an H100 in a one-to-one drag race in Llama 2 70B. ®

Related stories

Other stories