Sign up for The Media Today, CJRâs daily newsletter.
On Tuesday, the internet infrastructure company Cloudflare announced that it will block AI bots from scraping data from its sites without opt-in permission. Cloudflare hosts about 20 percent of the Web, and the move is seen as a win for the publishing industry. Previously, website owners using Cloudflare could choose to block AI bots, also known as crawlers, when setting up a domain, but if they did nothing, crawlers were permitted to scrape the sites. Such free access to online content has enabled AI companies to create massive data training sets for large language models. The new policy, however, will shift the default to require site owners to actively allow crawlers. Cloudflare is also experimenting with a plan called âpay per crawlâ that lets publishers set fees for AI companies to access their sites.
For the past couple of decades, Google and other search engines worked by returning a ranked list of website results in response to a userâs query. Last year, however, Google launched AI Overview, a feature that displays AI-generated summaries of search results at the top of the page. This means that users can get answers to their questions without leaving Googleâresulting in a significant drop in traffic referrals to publishers. This trend isnât likely to reverse anytime soon. According to Cloudflare CEO Matthew Prince, the interface of the future Web will look more like ChatGPT than a spartan search box and ten blue links. For the news industryâalready struggling with a deteriorating business modelâthat vision is not an encouraging one. So itâs unsurprising that a flock of major news publishers, including the Associated Press, Time, The Atlantic, and even Reddit, have signed on with Cloudflare. âFor too long, giant AI companies have built businesses on training data that they never paid for, and by scraping sites from whom they havenÊŒt even asked permission,â Nicholas Thompson, CEO of The Atlantic, said in a statement.Â
Publishers have not been entirely defenseless against AI bots up to now. For decades, website hosts have relied on the Robots Exclusion Protocol (commonly implemented via a robots.txt file) to instruct crawlers on which parts of a website theyâre allowed to access. While itâs not illegal to ignore these directives, doing so is generally considered bad internet etiquette. However, several investigations suggest that some AI companies still chew digital content with their mouths wide open. In some cases, AI crawlers have hammered websites so hard that their bandwidth was stretched to the limit. An analysis carried out by developer Robb Knight found that Perplexity ignores robots.txt files, despite Perplexity claiming otherwise. Last year, Wired caught Perplexity trespassing on the magazineâs website and those of other CondĂ© Nast publications.Â
According to Bill Gross, an entrepreneur who helped make Googleâs search advertising profitable, the AI bots are shoplifting, and as such, he argues, AI services should pay up. Gross founded ProRata, an AI startup that participates in Cloudflareâs pay-per-crawl program. Pay-per-crawl allows domain owners to set a flat, per-request price across their entire site. Still in its early stages, the experiment raises questions about how a pay-per-crawl market might create pricing tiers that publishers can command. For instance, will major publications like the New York Times be able to charge double or quadruple per crawl compared with a local newspaper?Â
Not all crawlers are AI bots; some enhance security, archive webpages, or index them for search engines. As such, there are crawlers that publishers may want to access their site. Cloudflare claims to be able to accommodate these benevolent bots by allowing domain owners to selectively bypass payment for certain crawlers. For others, they might selectively impose a punishment; one of Cloudflareâs other features traps misbehaving bots in an âAI Labyrinthâ that resembles a webpage but spins them around in circles.
Shayne Longpre, a PhD candidate at MIT, thinks that this sort of pushback against crawlers also threatens the transparency and open borders of the Web. Writing for the MIT Technology Review earlier this year, Longpre argued that raising a drawbridge to crawlers could shrink the internetâs biodiversity. âUltimately, the Web is being subdivided into territories where fewer crawlers are welcome,â Longpre wrote. âFor real users, this is making it harder to access news articles, see content from their favorite creators, and navigate the Web without hitting logins, subscription demands, and captchas each step of the way.â
Has America ever needed a media defender more than now? Help us by joining CJR today.