pwshub.com

Nvidia admits Blackwell defect, but says it's fine now

Nvidia has confirmed earlier reports that its Blackwell generation of GPUs suffered from a design defect that adversely impacted the yields of the hotly anticipated accelerators.

"We executed a change to the Blackwell GPU mask to improve production yields," Nvidia CFO Colette Kress acknowledged on the companuy’s Wednesday's earnings call.

Neither Kress nor CEO Jensen Huang explained the nature of the defect beyond the mention of a mask. However, whatever was wrong with the manufacturing process will stop Nvidia from shipping Blackwell in Q4 as previously promised.

"The change to the mask is complete. There were no functional changes necessary. And so, we're sampling functional samples of Blackwell — Grace Blackwell in a variety of system configurations as we speak," Huang told investors.

Even if Nvidia manages to ship Blackwell on time, the change could simply mean a later ramp and revenue realization than previously planned. Even still, Kress insisted that "Blackwell production ramp is scheduled to begin in the fourth quarter and continue into fiscal year '26," and that in Q4, the GPU giant will "ship several billion dollars in Blackwell revenue."

Huang doubled down in Q&A: "When I said ship production in Q4, I meant shipping out. I don't mean starting to ship … I don't mean starting production."

While Huang and Kress attempted to play off the mask swap as a minor matter, Gartner analyst Gaurav Gupta told The Register that design changes like these can be extremely costly, especially this late in production.

"Typically, you want these issues to be resolved early on," he said.

Gupta noted that even if Blackwell is late, Nvidia has two advantages: one is that, despite recent gains by AMD, Intel, and others, Nvidia still doesn't have much competition. The second factor is that the mask change could end up saving Nvidia a considerable amount of cash in the long run.

"If the new mask improves production yield, they will easily recover any loss due to delays. In chip fabrication, achieving high yields is critical, so less fabricated chips are discarded and production is more efficient and reliable, and it also helps improve cycle time."

Nvidia's disclosure mirrors similar statements made by Foxconn execs earlier this month that Grace-Blackwell-based products would begin shipping in small volumes in the fourth quarter.

"We are on track to develop and prepare the manufacturing of the new AI server to start shipping in small volumes in the last quarter of 2024, and increase the production volume in the first quarter of next year," Foxconn spokesperson James Wu previously said.

Announced at Nvidia’s GTC conference in March 2024, the Blackwell generation of GPUs boast more than twice the VRAM of the H100 and 2.5-5x higher performance. Achieving this performance uplift requires some major changes in design philosophy including the move to a multi-die configuration – an approach similar to that used by AMD and Intel in their latest generation of accelerators.

Making matters more complicated is that Nvidia is pushing its Grace-Blackwell Superchips, AKA the GB200, much harder this generation. The 2,700W parts pair two Blackwell GPUs with a single 72-core Grace CPU. Thirty-six of these superchips form its 120kW NVL72 rack systems, which Nvidia claims offer a 30x improvement in inference performance thanks to the speedy NVLink switch fabric tying everything together. We took a deep dive on that system back at GTC if you're curious.

  • Nvidia's growth slows to a mere 122 percent but it's still topping expectations
  • Cerebras gives waferscale chips inferencing twist, claims 1,800 token per sec generation rates
  • Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands
  • All that new AI-fueled datacenter space? Yeah, that's mostly ours – cloud giants

However, it appears something about the design wasn't quite right, necessitating the mask change. As we previously reported, Nvidia had supposedly warned Microsoft that shipments of the chips had been delayed due to a problem with the packaging tech used to stitch Blackwell's two dies together.

Nvidia was therefore said to be prioritizing its flagship GB200 parts over the lower spec HGX B100 and B200 configurations and would bring a trimmed-down Blackwell config with a single compute die called the B200A to market as a stop gap measure.

And while Huang's comments seem to confirm Nvidia is prioritizing its GB200 SKUs, those are also the chips that offer the highest efficiency when running large models. As it stands, Nvidia's HGX/DGX H100 platforms, with their eight GPUs, struggle to support models with more than a few hundred billion parameters. Llama 3 405B can only run on an HGX H100 system when quantized to 8-bit precision. Nvidia's top-specced Blackwell systems should be able to support models more than 10x that in size.

As our sibling site The Next Platform points out, not disclosing larger issues at this point would likely draw the ire of investigators and the SEC. Considering Nvidia is already under the Department of Justice's looking glass, it seems transparency would be the best play here. So while there's clearly some truth to Blackwell's manufacturing defects, it may not be as severe as rumored.

In any case, it's clear that Nvidia's leadership is keen to telegraph that all is fine and well this week. Amid the Hot Chips conference this week, the AI infrastructure goliath revealed that Blackwell was performing well enough to merit an MLPerf submission, which it claims demonstrated 4x lead over an H100 in a one-to-one drag race in Llama 2 70B. ®

Source: theregister.com

Related stories
2 weeks ago - Uncle Sam apparently worried GPU giant may be punishing customers who shop around The US Department of Justice on Tuesday is said to have stepped up its antitrust investigation into Nvidia, issuing subpoenas seeking evidence for its case...
2 weeks ago - On Wednesday, AMD released benchmarks comparing the performance of its MI300X with Nvidia's H100 GPU to showcase its Gen AI inference capabilities. For the LLama2-70B model, a system with eight Instinct MI300X processors reached a...
2 weeks ago - Needs new investors to get beyond current modest products Chinese GPU-maker Xiangdixian Computing Technology has admitted it has not met its development targets and let go of some staff as part of a restructuring plan.…
1 week ago - Nvidia has seen its share price increase around 2,227% over the last five years, a feat that was mostly achieved through its dominance of the AI hardware market.Read Entire Article
2 days ago - Is the hype over already? Despite all the hype and billions poured into AI, fewer than half of S&P 500 firms actually mentioned it in their Q2 2024 earnings reports. …
Other stories
6 minutes ago - Experts at the Netherlands Institute for Radio Astronomy (ASTRON) claim that second-generation, or "V2," Mini Starlink satellites emit interference that is a staggering 32 times stronger than that from previous models. Director Jessica...
6 minutes ago - The PKfail incident shocked the computer industry, exposing a deeply hidden flaw within the core of modern firmware infrastructure. The researchers who uncovered the issue have returned with new data, offering a more realistic assessment...
6 minutes ago - Nighttime anxiety can really mess up your ability to sleep at night. Here's what you can do about it right now.
6 minutes ago - With spectacular visuals and incredible combat, I cannot wait for Veilguard to launch on Oct. 31.
6 minutes ago - Finding the perfect pair of glasses is difficult, but here's how to do so while considering your face shape, skin tone, lifestyle and personality.