Cloudflare Blocks AI Bots from Scraping Web Content Without Permission

Sign up for The Media Today, CJR’s daily newsletter.

On Tuesday, the internet infrastructure company Cloudflare announced that it will block AI bots from scraping data from its sites without opt-in permission. Cloudflare hosts about 20 percent of the Web, and the move is seen as a win for the publishing industry. Previously, website owners using Cloudflare could choose to block AI bots, also known as crawlers, when setting up a domain, but if they did nothing, crawlers were permitted to scrape the sites. Such free access to online content has enabled AI companies to create massive data training sets for large language models. The new policy, however, will shift the default to require site owners to actively allow crawlers. Cloudflare is also experimenting with a plan called “pay per crawl” that lets publishers set fees for AI companies to access their sites.

For the past couple of decades, Google and other search engines worked by returning a ranked list of website results in response to a user’s query. Last year, however, Google launched AI Overview, a feature that displays AI-generated summaries of search results at the top of the page. This means that users can get answers to their questions without leaving Google—resulting in a significant drop in traffic referrals to publishers. This trend isn’t likely to reverse anytime soon. According to Cloudflare CEO Matthew Prince, the interface of the future Web will look more like ChatGPT than a spartan search box and ten blue links. For the news industry—already struggling with a deteriorating business model—that vision is not an encouraging one. So it’s unsurprising that a flock of major news publishers, including the Associated Press, Time, The Atlantic, and even Reddit, have signed on with Cloudflare. “For too long, giant AI companies have built businesses on training data that they never paid for, and by scraping sites from whom they havenʼt even asked permission,” Nicholas Thompson, CEO of The Atlantic, said in a statement.

Publishers have not been entirely defenseless against AI bots up to now. For decades, website hosts have relied on the Robots Exclusion Protocol (commonly implemented via a robots.txt file) to instruct crawlers on which parts of a website they’re allowed to access. While it’s not illegal to ignore these directives, doing so is generally considered bad internet etiquette. However, several investigations suggest that some AI companies still chew digital content with their mouths wide open. In some cases, AI crawlers have hammered websites so hard that their bandwidth was stretched to the limit. An analysis carried out by developer Robb Knight found that Perplexity ignores robots.txt files, despite Perplexity claiming otherwise. Last year, Wired caught Perplexity trespassing on the magazine’s website and those of other Condé Nast publications.

According to Bill Gross, an entrepreneur who helped make Google’s search advertising profitable, the AI bots are shoplifting, and as such, he argues, AI services should pay up. Gross founded ProRata, an AI startup that participates in Cloudflare’s pay-per-crawl program. Pay-per-crawl allows domain owners to set a flat, per-request price across their entire site. Still in its early stages, the experiment raises questions about how a pay-per-crawl market might create pricing tiers that publishers can command. For instance, will major publications like the New York Times be able to charge double or quadruple per crawl compared with a local newspaper?

Not all crawlers are AI bots; some enhance security, archive webpages, or index them for search engines. As such, there are crawlers that publishers may want to access their site. Cloudflare claims to be able to accommodate these benevolent bots by allowing domain owners to selectively bypass payment for certain crawlers. For others, they might selectively impose a punishment; one of Cloudflare’s other features traps misbehaving bots in an “AI Labyrinth” that resembles a webpage but spins them around in circles.

Shayne Longpre, a PhD candidate at MIT, thinks that this sort of pushback against crawlers also threatens the transparency and open borders of the Web. Writing for the MIT Technology Review earlier this year, Longpre argued that raising a drawbridge to crawlers could shrink the internet’s biodiversity. “Ultimately, the Web is being subdivided into territories where fewer crawlers are welcome,” Longpre wrote. “For real users, this is making it harder to access news articles, see content from their favorite creators, and navigate the Web without hitting logins, subscription demands, and captchas each step of the way.”

Cloudflare Blocks AI Bots from Scraping Web Content Without Permission

About

Support CJR

Advertise