News Publishers Block Internet Archive to Halt AI Training on Archived Content

Nearly 245 news organizations across nine countries are blocking the Internet Archive's web crawlers, concerned that AI companies are using archived content to train large language models without permission or payment.

The Internet Archive, home to over one trillion web pages dating back to 1996, is a vast public resource used by historians and journalists. Major outlets like CNN, The New York Times, and The Guardian are among those affected.

Originality AI reports over 20 major news outlets already block the Archive's main crawler, ia_archiverbot. A total of 241 global news sites block at least one of the Archive's four crawlers. The largest blocked group is owned by USA Today Co, removing hundreds of local publications from historical records.

Archival news content offers high-quality, structured data ideal for AI training. It's already found in key AI-training datasets, fueling lawsuits against AI firms like Perplexity and OpenAI for copyright violations.

Graham James, a New York Times spokesperson, said: "Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us." He added that the newspaper's original journalism should not be used without permission.

The Guardian has taken a more limited approach, restricting rather than fully blocking access.

Mark Graham, director of the Wayback Machine, insists the Archive is "collateral damage" in the AI copyright war. The Archive has limited large downloads and automated extraction, but Graham warns that blocking threatens preservation, allowing articles to be edited without accountability.

Some news organizations are seeking compromises. The non-profit Fight for the Future has launched a petition, signed by 100 journalists, opposing the blocks, citing the need to protect public records.