pwshub.com

OpenAI will show secret training data to copyright lawyers

OpenAI has agreed to reveal the data used to train its generative AI models to attorneys pursuing copyright claims against the developer on behalf of several authors.

The authors – among them Paul Tremblay, Sarah Silverman, Michael Chabon, David Henry Hwang, and Ta-Nehisi Coates – sued OpenAI and its affiliates last year, arguing its AI models have been trained on their books and reproduce their words in violation of US copyright law and California's unfair competition rules. The writers' actions have been consolidated into a single claim [PDF].

OpenAI faces similar allegations from other plaintiffs, and earlier this year, Anthropic was also sued by aggrieved authors.

On Tuesday, US magistrate judge Robert Illman issued an order [PDF] specifying the protocols and conditions under which the authors' attorneys will be granted access to OpenAI's training data.

The terms of access are strict, and consider the training data set the equivalent of sensitive source code, a proprietary business process, or secret formula. Even so, the models used for ChatGPT (GPT-3.5, GPT-4, etc.) presumably relied heavily on publicly accessible data that's widely known, as was the case with GPT-2 for which a list of domains whose content was scraped is on GitHub (The Register is on the list).

"Training data shall be made available by OpenAI in a secure room on a secured computer without internet access or network access to other unauthorized computers or devices," the judge's order states.

No recording devices will be permitted in the secure room and OpenAI's legal team will have the right to inspect any notes made therein.

OpenAI did not immediately respond to a request to explain why such secrecy is required. One likely reason is fear of legal liability – if the extent of permissionless use of online data were widely known, that could prompt even more lawsuits.

Forthcoming AI regulations may force developers to be more forthcoming about what goes into their models. Europe's Artificial Intelligence Act, which takes effect in August 2025, declares, "In order to increase transparency on the data that is used in the pre-training and training of general-purpose AI models, including text and data protected by copyright law, it is adequate that providers of such models draw up and make publicly available a sufficiently detailed summary of the content used for training the general-purpose AI model."

The rules include some protections for trade secrets and confidential business information, but make clear that the information provided should be detailed enough to satisfy those with legitimate interests – "including copyright holders" – and to help them enforce their rights.

California legislators have approved an AI data transparency bill (AB 2013), which awaits governor Gavin Newsom's signature. And a federal bill, the Generative AI Copyright Disclosure Act, calls for AI models to notify the US Copyright Office of all copyrighted content used for training.

  • Altman reportedly asks Biden to back a slew of multi-gigawatt-scale AI datacenters
  • US Army drafts AI to combat recruitment shortfall
  • Apple ropes off at least 4 GB of iPhone storage to house AI
  • AI PCs will dominate shipments by 2026, but not because of demand

The push for training data transparency may concern OpenAI, which already faces many copyright claims. The Microsoft-affiliated developer continues to insist that its use of copyrighted content qualifies as fair use and is therefore legally defensible. Its attorneys said as much in their answer [PDF] last month to the authors' amended complaint.

"Plaintiffs allege that their books were among the human knowledge shown to OpenAI's models to teach them intelligence and language," OpenAI's attorneys argue. "If so, that would be paradigmatic transformative fair use."

That said, OpenAI's legal team contends that generative AI is about creating new content rather than reproducing training data. The processing of copyrighted works during the model training process allegedly doesn't infringe because it's just extracting word frequencies, syntactic partners, and other statistical data.

"The purpose of those models is not to output material that already exists; there are much less computationally intensive ways to do that," OpenAI's attorneys claim. "Instead, their purpose is to create new material that never existed before, based on an understanding of language, reasoning, and the world."

That's a bit of misdirection. Generative AI models, though capable of unexpected output, are designed to predict a series of tokens or characters from training data that's relevant to a given prompt and adjacent system rules. Predictions insufficiently grounded in training data are called hallucinations – "creative" though they may be, they are not a desired result.

No open and shut case

Whether AI models reproduce training data verbatim is relevant to copyright law. Their capacity to craft content that's similar but not identical to source data – "money laundering for copyrighted data," as developer Simon Willison has described it – is a bit more complicated, legally and morally.

Even so, there's considerable skepticism among legal scholars that copyright law is the appropriate regime to address what AI models do and their impact on society. To date, US courts have echoed that skepticism.

As noted by Politico, US District Court judge Vincent Chhabria last November granted Meta's motion to dismiss [PDF] all but one of the claims brought on behalf of author Richard Kadrey against the social media giant over its LLaMa model. Chhabria called the claim that LLaMa itself is an infringing derivative work "nonsensical." He dismissed the copyright claims, the DMCA claim and all of the state law claims.

That doesn't bode well for the authors' lawsuit against OpenAI, or other cases that have made similar allegations. No wonder there are over 600 proposed laws across the US that aim to address the issue. ®

Source: theregister.com

Related stories
10 hours ago - Looks like it's going to be the Sam Altman show from now on Three key OpenAI staff members – CTO Mira Murati, Chief Research Officer Bob McGrew, and Research VP Barret Zoph – are leaving the ChatGPT maker.…
2 weeks ago - New o1 language model can solve complex tasks iteratively, count R's in "strawberry."
1 week ago - AI needs a lot of electricity and water to stay cool in data centers. We break down the toll prompt-by-prompt to show the scale of AI’s environmental needs.
1 week ago - Chatbots can give very humanlike responses to our prompts and queries, but they don't think -- or learn -- the way we do.
1 month ago - That one weird thing in Outlook that gives phishers and scammers an in to an inbox Users are urging Microsoft to rethink how it shows sender email addresses in Outlook because phishing criminals are taking advantage, using helpful,...
Other stories
56 minutes ago - Zorin OS requires only 1.5 GB of RAM and 15 GB of storage, making it ideal for reviving old PCs or for those looking to try a lightweight operating system. The most recent 17.2 release is powered by Linux Kernel 6.8.Read Entire Article
1 hour ago - From allegations of lying about capabilities to fake reviews. Plus: Biden AI robocaller finally fined $6M The FTC has made good on its promise to crack down on suspected deceptive AI claims, announcing legal action against five outfits...
1 hour ago - Vengeance has a new face — "To stop the assassin, you must become the assassin." Ana de Armas stars as...
1 hour ago - COME TOGETHER — The organizations have worked closely together over the years. The Tor Project The...
2 hours ago - 33% of cloud environments using the toolkit impacted, we're told A critical bug in Nvidia's widely used Container Toolkit could allow a rogue user or software to escape their containers and ultimately take complete control of the...