The forum and social media platform Reddit is once again fighting back against the automated harvesting of data for the purpose of AI training. Now, Reddit is even blocking the Internet Archive's Wayback Machine, as "The Verge" reports. The aim is to prevent data-hungry AI companies from downloading data through the Wayback Machine's backdoor.
The current move seems quite aggressive, but also obvious: Reddit has been selling access to its data to AI companies for quite some time now. Google, for example, pays the platform $60 million annually to be allowed to use the data for Gemini and other companies. OpenAI also reportedly has a similar deal. Others, like Anthropic, are excluded until they are willing to invest money.
Because the numerous Reddit posts, some of which contain hundreds of contextually relevant replies, are quite useful for AI training, Reddit has been blocking the scrapers of non-paying AI companies for some time now. Apparently, resourceful AI companies were able to circumvent this ban for several months by simply scraping the Reddit threads collected by the Internet Archive instead.
The Wayback Machine isn't completely shut down, by the way—at least the front page with the most popular posts continues to be archived. Reddit told The Verge that the ban is in place "until they are able to defend their website and comply with the platform's policies." If AI scrapers could be reliably excluded from the Wayback Machine, resuming archiving work would likely be possible.