Reddit plans new changes to protect against AI bots
Reddit changed robots.txt to stop AI bots from scraping its content. The changes include restrictions and payment for data use in AI training, as there are concerns about content misuse.
Automated web bots use the Robots Exclusion Protocol file to determine if they can crawl a site. Reddit said on Tuesday that it is updating this file.
Search engines used the robots.txt file in the past to instruct them on how to crawl a site and then direct users to the content.
However, with the increasing popularity of AI, people are scraping websites to train models without acknowledging the source.
Along with the new robots.txt file, Reddit will continue to limit the number of bots and crawlers that can access its platform and block those that are not known.
They stated that if bots and crawlers violate Reddit’s public content policy and lack an agreement with the platform, they will face blocking or limited posting speed.
According to Reddit, the update should have no effect on most users or “good faith actors,” such as researchers and groups like the Internet Archive.
The update aims to prevent AI companies from using Reddit content to train their large-scale language models. The robots.txt file on Reddit could, of course, be ignored by AI crawlers.
The search startup Perplexity, which uses AI, has been stealing and scraping content. Despite blocking the startup in its robots.txt file, Perplexity appears indifferent to requests not to scrape its website.
When asked about the claims, Aravind Srinivas, CEO of Perplexity, said that the robots.txt file is not a legal framework.
Impact of Reddit’s policy changes on AI data usage
The upcoming changes will not affect companies that have deals with Reddit. Reddit, for example, has a $60 million deal with Google that lets Google train its AI models on content from the social platform.
These changes now require payment from other companies wishing to use Reddit’s data for AI training.
A blog post stated that anyone accessing Reddit content must abide by our rules, which also aim to safeguard Redditors. “They have to be someone we trust to have access to a lot of Reddit content.”