Social Climate Tech News

Fri 05 2024

Cloudflare Introduces 1-Click Solution to Block Web-Scraping AI Bots

by bernt & torsten

Cloudflare has introduced a new feature that allows web hosting customers to block AI bots from scraping website content without permission. This move aims to address customer dissatisfaction with AI bots that dishonestly harvest content and to protect internet safety for content creators. The feature offers a straightforward one-click option to block all AI bots.

Traditionally, website owners have used the robots.txt file in a website’s root directory to instruct automated crawlers to avoid certain areas of their site. Despite its availability, this method can be ignored by bots, similar to how the Do Not Track header in browsers can be disregarded, often without repercussions.

Concerns over generative AI content scraping have been mounting, leading to lawsuits against AI companies accused of content theft. In response, some firms have allowed web publishers to opt out of having their content scraped. For instance, last August, OpenAI provided guidance on how to block its GPTbot using robots.txt, while Google followed with similar measures the next month. In September of the previous year, Cloudflare introduced an option to block rule-abiding AI bots, which 85 percent of its customers activated. The new feature aims to fortify these defences.

Cloudflare reports that AI bots account for about 39 percent of the traffic to the top one million web properties it serves, highlighting the pervasiveness of the issue.  As evidenced by recent investigations, bots can and do ignore robots.txt directives. Amazon, for example, reported that bots associated with AI search firm Perplexity had crawled websites, including news sites, without proper credit or permission, despite an expectation that these bots adhere to robots.txt. Perplexity’s CEO denied intentional neglect but admitted that third-party bots were responsible for the unauthorized scraping.

Cloudflare's machine-learning model has consistently identified these bots, even when they disguise themselves as legitimate browsers using spoofed user agents. The company's global machine-learning system has flagged the Perplexity bot as likely automated based on digital fingerprinting techniques that detect patterns in network interactions.

With its extensive network processing an average of 57 million requests per second, Cloudflare leverages ample data to distinguish trustworthy digital fingerprints from those of bots. The new feature, available to all customers, including those on the free tier, can be accessed by toggling the "Block AI Scrapers and Crawlers" button in a website's security settings.

Cloudflare acknowledges that some AI companies may attempt to circumvent these new measures, but it remains committed to refining its detection models and enhancing protection for content creators. The goal is to ensure that content creators maintain full control over how their material is used in AI training and inference, preserving the integrity of the internet as a space for genuine content creation.