How AI companies are secretly collecting training data from the web (and why it matters)

July 7, 2025

22

Table of Contents

Like most individuals, my spouse sorts a search into Google many instances every day. We work at home, so our household room doubles as a convention room. At any time when we’re in a gathering, and a query about something comes up, she Googles it.

This is similar as it has been for years. However what occurs subsequent has modified.

As an alternative of clicking on one of many search outcome hyperlinks, she most of the time reads the AI abstract. Today, she not often clicks on any of the websites that present the unique data that Google’s AI summarizes.

Once I spoke to her about this, Denise acknowledged that she really visits websites much less regularly. However she additionally identified that, for matters the place she’s well-versed, she has observed the AI is typically mistaken. She mentioned she takes the AI outcomes with a grain of salt, however they typically present primary sufficient data that she must look no additional. If unsure, she does dig deeper.

In order that’s the place we’re immediately. An increasing number of customers are like my spouse, getting information from the AI and by no means visiting web sites (and subsequently by no means giving content material creators an opportunity to be compensated for his or her work).

Worse, increasingly more individuals are trusting AI, so not solely are they making it tougher for content material creators to make a dwelling, however they’re typically getting hallucinatory or incorrect data. Since they by no means go to the unique sources of knowledge, they’ve little impetus to cross-check or confirm what they learn.

The impression of AI scraping

Cloudflare CEO Matthew Prince provided some devastating statistics. He used the ratio of the variety of pages crawled in comparison with the variety of pages fed to readers as a metric.

As a baseline, he mentioned that 10 years in the past, for each two pages Google crawled, it despatched one customer to a content material creator’s website. Six months in the past, that ratio was six pages crawled to at least one customer despatched to a content material website. Now, simply six months later, it is 18 pages crawled to at least one customer despatched to a content material website.

The numbers, based on Prince, are far worse for AI websites. AI websites derive substantial worth from data they’ve scraped from all the remainder of us. Six months in the past, the ratio of pages scraped to guests redirected by way of OpenAI was 250 to 1. Now, as folks have change into extra acquainted with trusting (or being too lazy to care about inaccuracies), the ratio is 1,500 to 1.

In some ways, AI is turning into an existential menace to content material creators. By vacuuming up content material produced by hard-working groups all the world over, after which feeding that content material again as summaries to readers, the publishers and writers are dropping income and affect. Many creators are additionally dropping motivation, as a result of if they can not make a dwelling doing it, or a minimum of create a following, why trouble?

Some publishers, like Ziff Davis (ZDNET’s dad or mum firm) and the New York Occasions, are suing OpenAI for copyright infringement. You’ve got most likely seen the disclaimer on ZDNET that claims, “Disclosure: Ziff Davis, ZDNET’s dad or mum firm, filed an April 2025 lawsuit towards OpenAI, alleging it infringed Ziff Davis copyrights in coaching and working its AI techniques.”

Different publishers, together with the Wall Avenue Journal, the Monetary Occasions, the Atlantic, and the Washington Submit, have licensed their content material to OpenAI and another AI giant language fashions.

The injury to society as a complete that AI intermediation may cause is profound and value an article all by itself. However this text is extra sensible. Right here, we acknowledge the menace AI presents to publishing, and concentrate on technical methods to battle again.

In different phrases, if the AIs cannot scrape, they can not give away revealed and copyrighted content material with out publishers’ permission.

Robots.txt: Your first protection

The only, most direct, and presumably least efficient protection is the robots.txt file. It is a file you place on the root of your web site’s listing. It tells spiders, crawlers, and bots whether or not they have permission to entry your website. That is additionally referred to as Consumer-Agent filtering.

This file has numerous attention-grabbing implications. First, solely well-behaved crawlers will take note of its specs. It does not present any safety towards entry, so compliance is totally voluntary on the a part of the bots.

Second, that you must watch out which bots you ship away. For instance, when you use robots.txt to disclaim entry to Googlebot, your website will not get listed for looking on Google. Say goodbye to all Google referrals. Alternatively, when you use robots.txt to disclaim entry to Google-Prolonged, you will block Gemini from indexing and utilizing your website for Gemini coaching.

This website has an index of these bots you may wish to deny entry to. That is OpenAI’s information on forestall OpenAI’s bots from crawling your website.

However what about internet scrapers that ignore robots.txt? How do you forestall them from scraping your website?

How will you forestall rogue scraping?

It is right here that website operators want to make use of a belts-and-suspenders technique. You are principally in an arms race to discover a approach to defend towards scraping, whereas the scrapers are looking for a approach to suck down all of your website’s information. On this part, I will record a number of strategies. That is removed from an entire record. Strategies change always, each on the a part of the defenders and the scrapers.

Price restrict requests: Modify your server to restrict what number of pages might be requested by a given IP deal with in a time frame. People aren’t prone to request a whole lot of pages per minute. This, like a lot of the strategies itemized on this part, will differ from server to server, so you will need to search for your server to learn the way to configure this functionality. It could additionally annoy your website’s guests a lot that they cease visiting. So, there’s that.

Use CAPTCHAs: Take into account that CAPTCHAs are likely to inconvenience customers, however they’ll cut back some sorts of crawler entry to your website. After all, the irony is that when you’re attempting to dam AI crawlers, it is the AIs which can be almost certainly to have the ability to defeat the CAPTCHAs. So there’s that.

Selective IP bans: Should you discover there are IP ranges that overwhelm your website with entry requests, you may ban them on the firewall degree. FireHOL (an open supply firewall toolset) maintains a blacklist of IP addresses. Most of them are cybersecurity-related, however they’ll get you began on a block record. Watch out, although. Do not use blanket IP bans, or legit guests will likely be blocked out of your website. So, there’s that, too.

The rise of anti-scraping providers

There are a rising variety of anti-scraping providers that may try to defend your website for a charge. They embrace:

QRATOR: Community-layer filtering and DDoS-aware bot blocking
Cloudflare: Fame-tracking, fingerprinting, and behavioral evaluation
Akamai Bot Supervisor: Identification, intent, and behavioral modeling
DataDome: Machine studying plus real-time response
HUMAN Safety: JavaScript sensors with Al backend
Kasada: Adaptive challenges and so-called tamper-proof JavaScript telemetry
Imperva: Risk intelligence plus browser fingerprinting
Fastly: Rule-based filtering with edge logic
Fingerprint: Cross-session fingerprinting and person monitoring
Link11: Behavioral evaluation and visitors sandboxing
Netacea: Intent-based detection and server-side analytics

Here is a fast overview of a number of the strategies these providers use.

Conduct matching: This system analyzes greater than headers; it analyzes request conduct. It is primarily a mix of header evaluation and bot-by-bot request limiting.

JavaScript challenges: Past JavaScript-based CAPTCHA, these typically run within the background of an online web page. They require scripts to execute or measure the pacing of interplay on the web page to permit additional entry.

Honeypot traps: These are sometimes components buried in an online web page, like invisible fields or hyperlinks, which can be designed to seize bots. If a bot grabs the whole lot on a website (which a human person is unlikely to do), the honeypot lure acknowledges it and initiates a server block.

General behavioral evaluation: That is the place AIs are preventing AIs. AIs operating on behalf of your web site monitor entry conduct, and use machine studying to establish entry patterns that aren’t human. These malicious accesses can then be blocked.

Browser fingerprinting: Browsers present a variety of information about themselves to the websites they entry. Bots typically try to spoof the fingerprints of legit customers. However they typically inadvertently present their very own fingerprints, which blocking providers can combination after which use to dam the bots.

Decoy traps: These are mazes of decoy pages stuffed with autogenerated and ineffective content material, linked collectively in a sample that causes bots to waste their time or get caught following hyperlinks. Most of these are tagged with “nofollow” hyperlinks, so serps do not index them or negatively have an effect on your website positioning rank. After all, malicious bots are studying establish these traps and counter them, however they do provide restricted safety.

The large trade-off of blocking scraping for AI coaching

As an creator who makes my dwelling instantly from my inventive output, I discover the prospect of AIs utilizing my work as coaching information to be offensive. How dare an organization like OpenAI make billions off the backs of all of us creatives! They then flip round and supply a product that would doubtlessly put many people out of labor.

And but, I’ve to acknowledge that AI has saved me time in many various methods. I exploit a textual content editor or a phrase processor daily. However again after I began my profession, the publications I wrote for had typesetting operators who transformed my written phrases into publishable content material. Now, the running a blog instruments and content material administration techniques do this work. A complete occupation vanished within the area of some years. Such is the value of latest know-how.

I have been concerned with AI innovation for many years. After writing about generative AI because it boomed in early 2023, I am satisfied it is right here to remain.

AI chatbots like Google Gemini and ChatGPT are making token efforts to be good residents. They scrape all our content material and make billions off of it, however they’re prepared to offer hyperlinks again to our work for the only a few who trouble to test sources.

A number of the huge AI corporations contend that they supply worth again to publishers. An OpenAI spokesperson informed Columbia Journalism Evaluate, “We help publishers and creators by serving to 400M weekly ChatGPT customers uncover high quality content material by means of summaries, quotes, clear hyperlinks, and attribution.”

Quoted in Digiday, David Carr, senior insights supervisor at information analytics firm Similarweb, mentioned, “ChatGPT despatched 243.8 million visits to 250 information and media web sites in April 2025, up 98% from 123.2 million visits this January.”

These numbers are huge, however solely with out context. Google will get billions of visits a day, and earlier than AI, practically all these visits resulted in referrals out to different websites. With Google’s referral percentages dropping precipitously and OpenAI’s referral numbers being a really small share of visitors in any other case despatched to content material producers, the issue may be very actual.

Sure, these hyperlinks are mere desk scraps, however can we block them? Should you allow internet scraping blocks in your web site, will it do something apart from “minimize off your nostril to spite your face,” as my mom used to say?

Until each website blocks AI scrapers, successfully locking AI information units to 2025 and earlier, blocking your individual website from the AIs will accomplish little greater than stopping you from getting what little visitors there’s from the AI providers. So do you have to?

In the long run, this observe of AI scraping is unsustainable. If AIs forestall creatives from deriving worth from their laborious work, the creatives will not have an incentive to maintain creating. At that time, the standard of the AI-generated content material will start to say no. It would change into a vicious circle, with fewer creatives in a position to monetize their abilities and the AIs offering ever-worsening content material high quality.

So, what can we do about it? If we’re to outlive into the long run, our complete business must ask and try to reply that query. If not, welcome to Idiocracy.

What about you? Have you ever taken any steps to dam AI bots from scraping your website? Are you involved about how your content material could be used to coach generative fashions? Do you assume the trade-off between visibility and safety is value it? What sorts of instruments or providers, if any, are you utilizing to observe or restrict scraping? Tell us within the feedback under.

You possibly can observe my day-to-day mission updates on social media. You should definitely subscribe to my weekly replace publication, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Supply hyperlink

Buy now

How AI companies are secretly collecting training data from the web (and why it matters)

The impression of AI scraping

Robots.txt: Your first protection

How will you forestall rogue scraping?

The rise of anti-scraping providers

The large trade-off of blocking scraping for AI coaching

Related Articles

Inside Walmart’s AI security stack: How a startup mentality is hardening...

TruthScan vs. Midjourney Images | Gold Penguin

Best early Labor Day laptop deals 2025: Sales on Apple, Dell,...

Leave a Reply Cancel reply

Latest Articles

Inside Walmart’s AI security stack: How a startup mentality is hardening...

TruthScan vs. Midjourney Images | Gold Penguin

Best early Labor Day laptop deals 2025: Sales on Apple, Dell,...

Microsoft AI chief says it’s ‘dangerous’ to study AI consciousness

Chan Zuckerberg Initiative’s rBio uses virtual cells to train AI, bypassing...