pwshub.com

Nvidia says scraping 80 years' worth of videos daily to train its AI models is in "the spirit of copyright law"

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

A hot potato: Once again, it's been revealed that a company has been scraping data from the internet to train its AI models using a questionable interpretation of copyright law. On this occasion, Nvidia has been downloading videos from YouTube, Netflix, and other platforms to gather data for its commercial AI products.

According to internal Slack chats, emails, spreadsheets, and several other sources obtained by 404 Media, Nvidia asked workers to download videos from various online platforms to compile data to train its Omniverse, autonomous vehicles, and digital human products.

Codenamed Cosmos, the project involved using between 20 and 30 virtual machines on Amazon Web Services to download the equivalent of 80 years of videos every day. Nvidia was downloading so much that it managed to accumulate over 30 million URLs in the space of one month.

In addition to Netflix and YouTube, Nvidia workers were told to train the AI models on movie trailer database MovieNet, internal libraries of video game footage, and Github video datasets WebVid, which have since been taken down. It also used InternVid-10M, a dataset containing 10 million YouTube video IDs.

Nvidia CEO Jensen Huang

Copyright issues are always at the forefront of discussions when it comes to companies scraping data from the web. This was reportedly discussed by Nvidia employees, who used several methods to try to circumvent any potential legal blowback, including using data marked as for academic or non-commercial purposes only.

HD-VG-130M was one of the datasets Nvidia used. This library of 130 million YouTube videos states in its license that it's for academic use only, something Nvidia appears to have ignored. Employees also used Google's cloud service to download the YouTube-8M dataset, as directly downloading the videos isn't allowed under the terms of service.

"We cleared the download with Google/YouTube ahead of time and dangled as a carrot that we were going to do so using Google Cloud," wrote one person in a Slack channel. "After all, usually, for 8 million videos, they would get lots of ad impressions, revenue they lose out on when downloading for training, so they should get some money out of it."

Nvidia also reportedly used VMs with rotating IP addresses in some cases to avoid YouTube detecting what it was doing and banning the users.

In April, it was reported that in order to access more reputable English language-based text on the internet in 2021, OpenAI researchers created a speech recognition tool called Whisper. It was designed to transcribe audio from YouTube videos, giving the company a trove of data to train its LLMs. Why didn't Google object? Possibly because it also transcribed YouTube videos for its AI models, potentially violating creators' copyrighted material.

YouTube previously said that scraping data to train AI models was a "clear violation" of its terms. Nvidia told 404 Media that its actions were "in full compliance with the letter and the spirit of copyright law."

If you were wondering whether Nvidia used gameplay footage from its own GeForce Now service to train its AI – no, it didn't, though it sounds like such a thing could happen at one point. "We don't yet have statistics or video files yet, because the infras is not yet set up to capture lots of live game videos & actions," a senior Nvidia research scientist told other employees. "There're both engineering & regulatory hurdles to hop through."

Many AI firms engaging in data scraping practices defend their actions by claiming it's fair use under copyright law. Music-generating AI startups Udio and Suno are using this excuse in their copyright lawsuits filed by major record companies.

Source: techspot.com

Related stories
3 weeks ago - The setback won't stop us from banking billions, CFO insists Nvidia has confirmed earlier reports that its Blackwell generation of GPUs suffered from a design defect that adversely impacted the yields of the hotly anticipated accelerators.…
1 month ago - The first title set to feature ACE is Mecha Break, an upcoming online mech battle game releasing on the PC, Xbox Series X/S, and PlayStation 5 in 2025. The game will let players pick from different classes of mechs, such as assault,...
2 weeks ago - While Nvidia dominates the advanced AI hardware market, the addition of the new tagline – first spotted by TechPowerUp – refers mostly to some of the gaming-focused AI elements the company offers.Read Entire Article
2 weeks ago - According to Bloomberg's report from earlier this week, DOJ officials subpoenaed Nvidia and companies including AMD and Intel as part of its antitrust investigation into whether Team Green is abusing its dominating position in the AI...
1 month ago - sync or swim — But G-Sync will still require specific G-Sync-capable MediaTek scaler chips. Nvidia ...
Other stories
47 minutes ago - After the last few entries visited historical and near-future time periods, the next Battlefield game will return to a modern-day setting, aiming to recapture the essence of Battlefield 3 and 4. The follow-up recently entered full...
48 minutes ago - The Windows App allows you to access your Windows PC, Azure Virtual Desktop, or Remote Desktop from almost any device. It is available for Windows, Macs, iPhones, iPads, and Android devices. The app supports multiple monitors, USB...
48 minutes ago - Why You Can Trust CNET Our expert, award-winning staff selects the products we cover and rigorously researches and tests our top picks. If you buy...
48 minutes ago - The video game Devil May Cry is getting its own animated Netflix show, and the streaming service revealed a teaser during Geeked Week on Thursday....
48 minutes ago - He's terrier-fying. And you can now change Skelly's spooky eyes to fit in with various holidays.