pwshub.com

Reddit stands firm against AI companies scraping content for training without paying

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

A hot potato: Reddit has been making moves as part of a crackdown on companies indiscriminately scraping the website for AI training purposes. Its philosophy is that AI companies stand to make millions or billions on large language models they are developing with resources they do not own. It's analogous to someone taking two-by-fours from a lumberyard to build their house just because the yard doesn't have a locked gate. But the issue goes way beyond Reddit and is central to how the open web has worked so far.

The Robots Exclusion Protocol is a web standard used to control and manage web crawler and bot access to websites. Defined by the robots.txt file, it tells search engines which parts of a site can be crawled or indexed, helping webmasters protect sensitive content and manage traffic efficiently. However, it works on the honor system with few ways to enforce it.

Last week, Ars Technica was reporting that Reddit posts were not appearing in any search engines except for Google. It's no big mystery that Reddit already penned a $60 million licensing deal with Alphabet to use its content for training – meanwhile Reddit has been increasingly ranking at the top of Google searches this past year (quid pro quo, or maybe not...).

The company also recently notified users that it changed its robots.txt file to exclude bots and crawlers that didn't have permission to access its data. Reddit CEO Steve Huffman said he believes in an open internet but that companies now use search engine web crawlers to scrape information for profit, a far cry from their historical use. "I think the traditional value exchange from search engines has changed," Huffman told The Verge.

"Search and summarization and training are merging, and the value exchange of crawling in exchange for traffic back is becoming muddied."

To this point, Huffman said that blocking companies unwilling to pay for data harvesting has been "a real pain in the ass," prompting the changes to Reddit's robots.txt. For the most part, companies have respected Reddit's wishes, and several, including Microsoft, Anthropic, and Perplexity, have entered negotiations to license its content.

Hoffman said that the biggest thorn in his side is that some companies scraping Reddit data are turning around and selling it to other AI firms via their APIs. He specifically called out Microsoft AI CEO Mustafa Suleyman for recently comparing all public data on the internet to "freeware."

"We've had Microsoft, Anthropic, and Perplexity act as though all of the content on the internet is free for them to use," said Huffman. "That's their real position." While Microsoft Bing has been gracious in respecting Reddit's decision to block its crawlers, the company managed to slip in a denigrating remark.

Microsoft AI CEO Mustafa Suleyman: the social contract for content that is on the open web is that it's "freeware" for training AI models pic.twitter.com/FN1xrqnJC0

– Tsarathustra (@tsarnick) June 26, 2024

"Reddit has blocked Bing from crawling their site for search, favoring another search engine and impacting competition from Bing and Bing-powered engines," Microsoft spokesperson Caitlin Roulston said last week. "We honor the directions provided by websites that do not want content on their pages to be used with our generative AI models."

So far, Google and OpenAI are the only search engines on Reddit's whitelist. If other engines return anything but outdated Reddit content, then they are not abiding by the website's robots.txt document.

Reddit profiting from user-generated content through these licensing deals is still a hot potato. On the one hand, the lucrative fees do not go into the pockets of the community who make up Reddit's forums. On the other hand, these licensing deals are not much different from those of other companies.

OpenAI already pays licensing fees to large publishers like Dotdash Meredith, Axel Springer, the Associate Press, and The Atlantic. It is unconfirmed but doubtful that these publications pass those profits to their writers via raises or bonuses. Does that make it right? No, and the courts are still trying to decide about this unprecedented activity. However, it's par for the course at this point.

And this very issue is not limited to Reddit but all online publishers, big and small. In the race against AI training abuse, Reddit is one of the few with the muscle and influence to call out AI companies. While big media companies try to monetize and reach agreements, the rest of the internet is struggling. In fact, some subreddits have their own bots that copy and paste entire written content from original sources and display it as the first comment in the thread, effectively copying the content and then selling that to AI companies.

Until there are governing regulations, the AI gold rush will be like the California gold rush of 1848. Artificial intelligence firms will continue flocking to shovel AI products down everyone's throats for profit or to gather more data. Meanwhile, companies like Reddit and Vox will keep handing them the shovels.

Image credit: Jernej Furman

Source: techspot.com

Related stories
1 month ago - Get up to speed on the rapidly evolving world of AI with our roundup of the week's developments.
2 weeks ago - I spoke with the tech pioneer about his new Netflix docuseries that touches on artificial intelligence, global warming and more.
23 hours ago - I spoke with the tech pioneer about his new Netflix docuseries that touches on the future of AI, global warming and more.
1 month ago - explain it to me like i'm an LLM — Long lists of instructions show how Apple is trying to navigate AI...
1 day ago - While the company has not officially commented on the issue, AMD has addressed it with a microcode update included in AGESA version 1.2.0.2. Some OEMs, including Asus, have already started implementing the firmware update. AGESA, which...
Other stories
3 minutes ago - After California passed laws cracking down on AI-generated deepfakes of election-related content, a popular conservative influencer promptly sued,...
27 minutes ago - Act fast to grab this high-performing mesh router for less than $500, keeping you connected while saving some cash too.
27 minutes ago - If the old-school PlayStation is dear to your heart, you can soon relive those totally sweet 1990s memories. Sony is releasing a series of products...
27 minutes ago - If you've got an old phone to part with, T-Mobile is offering both new and existing customers the brand-new Apple iPhone 16 Pro for free with this trade-in deal.
27 minutes ago - Who doesn't want the best for their beloved pooch? Grab some of these tasty treats to make your dog feel special.