Why Crawler Separation Is The One Path To A Good Web

Earlier this week, the UK’s Competitors and Markets Authority (CMA) opened its session on a bundle of proposed conduct necessities for Google. The session invitations feedback on the proposed necessities earlier than the CMA imposes any closing measures. These new guidelines goal to handle the shortage of alternative and transparency that publishers (broadly outlined as “any occasion that makes content material accessible on the internet”) face over how Google makes use of search to gasoline its generative AI companies and options. These are the primary consultations on conduct necessities launched below the digital markets competitors regime within the UK.

We welcome the CMA’s recognition that publishers want a fairer deal and consider the proposed guidelines are a step into the best course. Publishers needs to be entitled to have entry to instruments that allow them to regulate the inclusion of their content material in generative AI companies, and AI firms ought to have a stage enjoying discipline on which to compete.

However we consider the CMA has not gone far sufficient and may do extra to safeguard the UK’s inventive sector and foster wholesome competitors available in the market for generative and agentic AI.

CMA designation of Google as having Strategic Market Standing

In January 2025, the UK’s regulatory panorama underwent a major authorized shift with the implementation of the Digital Markets, Competitors and Customers Act 2024 (DMCC). Fairly than counting on antitrust investigations to handle dangers to competitors, the CMA can now designate corporations as having Strategic Market Standing (SMS) once they maintain substantial, entrenched market energy. This designation permits for focused CMA interventions in digital markets, similar to imposing detailed conduct necessities, to enhance competitors.

In October 2025, the CMA designated Google as having SMS typically search and search promoting, given its 90 p.c share of the search market within the UK. Crucially, this designation encompasses AI Overviews and AI Mode, with the CMA now having the authority to impose conduct necessities on Google’s search ecosystem. Last necessities imposed by the CMA usually are not merely recommendations however legally enforceable guidelines that may relate particularly to AI crawling with vital sanctions to make sure Google operates pretty.

Publishers want a significant approach to decide out of Google’s use of their content material for generative AI

The CMA’s designation couldn’t be extra well timed. As we’ve mentioned earlier than, we’re indisputably in a time when the Web wants clear “guidelines of the street” for AI crawling conduct.

Because the CMA rightly states, “publishers haven’t any practical choice however to permit their content material to be crawled for Google’s common search due to the market energy Google holds typically search. Nonetheless, Google at present makes use of that content material in each its search generative AI options and in its broader generative AI companies.”

In different phrases: the identical content material that Google scrapes for search indexing can also be used for inference/grounding functions, like AI Overviews and AI Mode, which depend on fetching reside data from the Web in response to real-time consumer queries. And that creates an enormous downside for publishers—and for competitors.

As a result of publishers can not afford to disallow or block Googlebot, Google’s search crawler, on their web site, they’ve to simply accept that their content material will likely be utilized in generative AI functions similar to AI Overviews and AI Mode inside Google Search that return little or no, if any, site visitors to their web sites. This undermines the ad-supported enterprise fashions which have sustained digital publishing for many years, given the crucial function of Google Search in driving human site visitors to internet marketing. It additionally signifies that Google’s generative AI functions enter into direct competitors with publishers by reproducing their content material, most frequently with out attribution or compensation.

Publishers’ reluctance to dam Google due to its dominance in search provides Google an unfair aggressive benefit available in the market for generative and agentic AI. Not like different AI bot operators, Google can use its search crawler to assemble information for quite a lot of AI features with little worry that its entry will likely be restricted. It has minimal incentive to pay publishers for that information, which it’s already getting totally free.

This prevents the emergence of a well-functioning market the place AI builders negotiate truthful worth for content material. As a substitute, different AI firms are disincentivized from coming to the desk, as they’re structurally deprived by a system that enables one dominant participant to bypass compensation completely. Because the CMA itself acknowledges, “[b]y not offering enough management over how this content material is used, Google can restrict the flexibility of publishers to monetise their content material, whereas accessing content material for AI-generated ends in a approach that its opponents can not match”.

Cloudflare information validates the priority about Google’s aggressive benefit. Primarily based on our information, Googlebot sees considerably extra Web content material than its closest friends.

Over an noticed interval of two months, Googlebot efficiently accessed particular person pages nearly two occasions greater than ClaudeBot and GPTBot, 3 times greater than Meta-ExternalAgent, and greater than 3 times greater than Bingbot. The distinction was much more excessive for different in style AI crawlers: as an illustration, Googlebot noticed 167 occasions extra distinctive pages than PerplexityBot. Out of the sampled distinctive URLs utilizing our community that we noticed during the last two months, Googlebot crawled roughly 8%.

In rounded a number of phrases, Googlebot sees:

vs. ~1.70x the quantity of distinctive URLs seen by ClaudeBot;
vs. ~1.76x the quantity of distinctive URLs seen by GPTBot;
vs. ~2.99x the quantity of distinctive URLs by Meta-ExternalAgent;
vs. ~3.26x the quantity of distinctive URLs seen by Bingbot;
vs. ~5.09x the quantity of distinctive URLs seen by Amazonbot;
vs. ~14.87x the quantity of distinctive URLs seen by Applebot;
vs. ~23.73x the quantity of distinctive URLs seen by Bytespider;
vs. ~166.98x the quantity of distinctive URLs seen by PerplexityBot;
vs. ~714.48x the quantity of distinctive URLs seen by CCBot; and
vs: ~1801.97x the quantity of distinctive URLs seen by archive.org_bot.

Googlebot additionally stands out in different Cloudflare datasets.

Regardless that it ranks as probably the most lively bot by general site visitors, publishers are far much less prone to disallow or block Googlebot of their robots.txt file in comparison with different crawlers. That is seemingly resulting from its significance in driving human site visitors to their content material—and, in consequence, advert income—by search.

As proven beneath, nearly no web site explicitly disallows the dual-purpose Googlebot in full, reflecting how necessary this bot is to driving site visitors through search referrals. (Observe that partial disallows usually influence sure elements of a web site which might be irrelevant for SEO, or website positioning, similar to login endpoints.)

Robots.txt merely permits the expression of crawling preferences; it’s not an enforcement mechanism. Publishers depend on “good bots” to conform. To handle crawler entry to their websites extra successfully—and independently of a given bot’s compliance—publishers can arrange a Internet Software Firewall (WAF) with particular guidelines, technically stopping undesired crawlers from accessing their websites. Following the identical logic as with robots.txt above, we’d anticipate web sites to dam principally different AI crawlers however not Googlebot.

Certainly, when evaluating the numbers for purchasers utilizing AI Crawl Management, Cloudflare’s personal AI crawler blocking software that’s built-in in our Software Safety suite, between July 2025 and January 2026, one can see that the variety of web sites actively blocking different in style AI crawlers (e.g., GPTBot, Claudebot), was practically seven occasions as excessive because the variety of web sites that blocked Googlebot and Bingbot. (Like Googlebot, Bingbot combines search and AI crawling and drives site visitors to those websites, however given its small market share in search, its influence is much less vital.)

So we agree with the CMA on the issue assertion. However how can publishers be enabled to successfully decide out of Google utilizing their content material for its generative AI functions? We share the CMA’s conclusion that “so as to have the ability to make significant selections about how Google makes use of their Search Content material, (…) publishers want the flexibility successfully to decide their Search Content material out of each Google’s search generative AI options and Google’s broader generative AI companies.”

However we’re involved that the CMA’s proposal is inadequate.

CMA’s proposed writer conduct necessities

On January 28, 2026, the CMA printed 4 units of proposed conduct necessities for Google, together with conduct necessities associated to publishers. In response to the CMA, the proposed writer guidelines are designed to handle considerations that publishers (1) lack enough alternative over how Google makes use of their content material in its AI-generated responses, (2) have restricted transparency into Google’s use of that content material, and (3) don’t get efficient attribution for Google’s use of their content material. The CMA acknowledged the significance of those considerations due to the function that Google search performs find content material on-line.

The conduct necessities would mandate Google grant publishers “significant and efficient” management over whether or not their content material is used for AI options, like AI Overviews. Google could be prohibited from taking any motion that negatively impacts the effectiveness of these management choices, similar to deliberately downranking the content material in search.

To assist knowledgeable decisionmaking, the CMA proposal additionally requires Google to extend transparency, by publishing clear documentation on the way it makes use of crawled content material for generative AI and on precisely what its varied writer controls cowl in follow. Lastly, the proposal would require Google to make sure efficient attribution of writer content material and to offer publishers with detailed, disaggregated engagement information—together with particular metrics for impressions, clicks, and “click on high quality”—to assist them consider the industrial worth of permitting their content material for use in AI-generated search summaries.

The CMA’s proposed cures are inadequate

Though we assist the CMA’s efforts to enhance choices for publishers, we’re involved that the proposed necessities don’t clear up the underlying problem of selling truthful, clear alternative over how their content material is utilized by Google. Publishers are successfully pressured to make use of Google’s proprietary opt-out mechanisms, tied particularly to the Google platform and below the situations set by Google, reasonably than granting them direct, autonomous management. A framework the place the platform dictates the principles, manages the technical controls, and defines the scope of software doesn’t supply “efficient management” to content material creators or encourage aggressive innovation available in the market. As a substitute, it reinforces a state of everlasting dependency.

Such a framework additionally reduces alternative for publishers. Creating new opt-out controls makes it inconceivable for publishers to decide on to make use of exterior instruments to dam Googlebot from accessing their content material with out jeopardizing their look in Search outcomes. As a substitute, below the present proposal, content material creators will nonetheless have to permit Googlebot to scrape their web sites, with no enforcement mechanisms to deploy and restricted visibility accessible if Google doesn’t respect their signalled preferences. Enforcement of those necessities by the CMA, if finished correctly, will likely be very onerous, with out assure that publishers will belief the answer.

In actual fact, Cloudflare has acquired suggestions from its clients that Google’s present proprietary opt-out mechanisms, together with Google-Prolonged and ‘nosnippet’, have failed to forestall content material from being utilized in ways in which publishers can not management. These opt-out instruments additionally don’t allow mechanisms for truthful compensation for publishers.

Extra broadly, as mirrored in our proposed accountable AI bot rules, we consider that every one AI bots ought to have one distinct goal and declare it, in order that web site homeowners could make clear selections over who can entry their content material and why. Not like its main opponents, similar to OpenAI and Anthropic, Google doesn’t adjust to this precept for Googlebot, which is used for a number of functions (search indexing, AI coaching, and inference/grounding). Merely requiring Google to develop a brand new opt-out mechanism wouldn’t permit publishers to realize significant management over the usage of their content material.

The best approach to give publishers that needed management is to require Googlebot to be cut up up into separate crawlers. That approach, publishers might permit crawling for conventional search indexing, which they should entice site visitors to their websites, however block entry for undesirable use of their content material in generative AI companies and options.

Requiring crawler separation is the one efficient resolution

To make sure a good digital ecosystem, the CMA should as a substitute empower content material homeowners to forestall Google from accessing their information for specific functions within the first place, reasonably than counting on Google-managed workarounds after the crawler has already accessed the content material for different functions. That strategy additionally allows creators to set situations for entry to their content material.

Though the CMA described crawler separation as an “equally efficient intervention”, it in the end rejected mandating separation based mostly on Google’s enter that it could be too onerous. We disagree.

Requiring Google to separate up Googlebot by goal — similar to Google already does for its practically 20 different crawlers — just isn’t solely technically possible, but in addition a needed and proportionate treatment that empowers web site operators to have the granular management they at present lack, with out rising site visitors load from crawlers to their web sites (and actually, even perhaps reducing it, ought to they select to dam AI crawling).

To be clear, a crawler separation treatment advantages AI firms, by leveling the enjoying discipline between them and Google, along with giving UK-based publishers extra management over their content material. (There was widespread public assist for a crawler separation treatment by Day by day Mail Group, the Guardian and the Information Media Affiliation.) Necessary crawler separation just isn’t an obstacle to Google, nor does it undermine funding in AI. Quite the opposite, it’s a pro-competitive safeguard that stops Google from leveraging its search monopoly to realize an unfair benefit within the AI market. By decoupling these features, we be certain that AI growth is pushed by fair-market competitors reasonably than the exploitation of a single hyperscaler’s dominance.

******

The UK has a novel likelihood to guide the world in defending the worth of authentic and high-quality content material on the Web. Nonetheless, we fear that the present proposals fall brief. We might encourage guidelines that be certain that Google operates below the identical situations for content material entry as different AI builders, meaningfully restoring company to publishers and paving the way in which for brand new enterprise fashions selling content material monetization.

Cloudflare stays dedicated to participating with the CMA and different companions throughout upcoming consultations to offer evidence-based information to assist form a closing choice on conduct necessities which might be focused, proportional, and efficient. The CMA nonetheless has a possibility to make sure that the Web turns into a good market for content material creators and smaller AI gamers—not only a choose few tech giants.

Top Posts

Rearchitecting the Workflows management airplane for the agentic period

The Firmware Fallacy: Why Bridging the NTN Hole in Large IoT Nonetheless Requires a {Hardware} Actuality Verify

7 Steps to Mastering Language Mannequin Deployment

why crawler separation is the one path to a good Web

Rearchitecting the Workflows management airplane for the agentic period

Amid intense scrutiny at Labor Division, new IG brings law-enforcement mindset

Add voice to your agent

It’s tax day and the IRS is providing extra assist to extension filers

AWS Interconnect is now typically accessible, with a brand new choice to simplify last-mile connectivity

Optimize object storage prices mechanically with good tier—now typically obtainable

Rearchitecting the Workflows management airplane for the agentic period

The Firmware Fallacy: Why Bridging the NTN Hole in Large IoT Nonetheless Requires a {Hardware} Actuality Verify

7 Steps to Mastering Language Mannequin Deployment

Bitcoin Development Reversal Might Affirm If BTC Closes Above $76K

Signed software program abused to deploy antivirus-killing scripts

Easy methods to Construct a Common Lengthy-Time period Reminiscence Layer for AI Brokers Utilizing Mem0 and OpenAI

Amid intense scrutiny at Labor Division, new IG brings law-enforcement mindset

Why Zorin OS 18.1 is just one of the best Linux distro – for anybody

Trending

Rearchitecting the Workflows management airplane for the agentic period

The Firmware Fallacy: Why Bridging the NTN Hole in Large IoT Nonetheless Requires a {Hardware} Actuality Verify

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

why crawler separation is the one path to a good Web

CMA designation of Google as having Strategic Market Standing

Publishers want a significant approach to decide out of Google’s use of their content material for generative AI

CMA’s proposed writer conduct necessities

The CMA’s proposed cures are inadequate

Requiring crawler separation is the one efficient resolution

Related Posts