Announced on September 24, Cloudflare’s Content Signals Policy is an effort to use the company’s influential market position to change how content is used by web crawlers. It involves updating millions of websites’ robots.txt files. //
Historically, robots.txt simply includes a list of paths on the domain that were flagged as either “allow” or “disallow.” It was technically not enforceable, but it became an effective honor system because there are advantages to it for the owners of both the website and the crawler: Website owners could dictate access for various business reasons, and it helped crawlers avoid working through data that wouldn’t be relevant. //
The Content Signals Policy initiative is a newly proposed format for robots.txt that intends to do that. It allows website operators to opt in or out of consenting to the following use cases, as worded in the policy:
- search: Building a search index and providing search results (e.g., returning hyperlinks and short excerpts from your website’s contents). Search does not include providing AI-generated search summaries.
- ai-input: Inputting content into one or more AI models (e.g., retrieval augmented generation, grounding, or other real-time taking of content for generative AI search answers).
- ai-train: Training or fine-tuning AI models.
Cloudflare has given all of its customers quick paths for setting those values on a case-by-case basis. Further, it has automatically updated robots.txt on the 3.8 million domains that already use Cloudflare’s managed robots.txt feature, with search defaulting to yes, ai-train to no, and ai-input blank, indicating a neutral position.