
Unwanted AI scraping of publisher websites is placing a costly financial burden on publishers, according to one leading industry executive.
Chris Dicker, chief executive of Candr Media Group and board member of the Independent Publishers Alliance, said that the publisher’s Trusted Reviews website was taken down multiple times on 16 August when it was scraped 1.6 million times in a day.
This was up from a previous record of 1.2 million scrapes on the site a day earlier.
He said the average level of AI scraping for Trusted Reviews is running at between approximately 70,000 and 100,000 times a day.
Out of the 1.6 million AI scrapes performed on 16 August, Dicker said he believed these resulted in 603 actual users of generative AI platforms later landing on the Trusted Reviews site, meaning a clickthrough rate from ChatGPT and other similar platforms of 0.037%. He noted in a Linkedin post that this was “dramatically lower than you would expect from traditional search”.
He told Press Gazette that the referral ratio for the previous week from an AI bot scrape resulting in a human visit was 1,888 to one.
He also said engagement from that traffic “performed poorly” with 58% less time spent on the site than an average user and 10% fewer pages viewed.
Dicker said Candr is currently preparing to change hosting provider and therefore had not taken some steps such as implementing Tollbit’s bot paywall, which allows sites to set fees for AI crawlers to access their content and which the publisher plans to do.
But Dicker told Press Gazette the data raised concerns about some AI companies not respecting robots.txt signals that a website does not want to be crawled by specified bots.
He said Tollbit’s dashboard for Trusted Reviews showed that OpenAI had ignored robots.txt to scrape the site 12.2 million times in the past three months, with Meta doing so 2.8 million times, Amazon 2.4 million times, Perplexity 101,000 and Bytedance 95,000.
Trusted Reviews saw an increase in AI bots ignoring robots.txt of 75% compared to the previous three months, he added, while in the past six months it was up 89% compared to the prior six months.
Website hosting company Cloudflare announced in July that it would allow websites to block all scrapers by default.
In response to Dicker’s Linkedin post, OpenAI’s head of media partnerships Varun Shetty said: “Took a look at your robots.txt file and you are allowing OAI-Searchbot access to your site. If that’s something you’d prefer not to do, you can block the crawler and you should see this issue resolved. We’re working closely with publishers on finding ways that our products can drive them value – will continue to do that.”
However, Dicker told Press Gazette that it was not OAI-Searchbot that was driving the spike in scraping – it was another OpenAI bot, ChatGPT-User, which is blocked by Trusted Reviews.
This bot has been the most prominent of the scrapers on Trusted Reviews, Dicker said, although there are occasional spikes from other tech companies such as Apple, when it said it planned to create an LLM, and Amazon, ahead of sales periods. Bytedance and Meta are also present.
Dicker also noted that with the onus on publishers to block bots via robots.txt, they need to know what the bots are in order to do so. This means that any new ones that spring up are harder to block in advance.
‘Actual hard cost’ to publishers of AI scraping is hard to pin down
Dicker also raised concerns around the cost of huge spikes in scraping to publishers, especially smaller ones.
“From an Independent Publishers Alliance perspective, some of our members have come to me and said, Chris, we’ve got an issue where we are getting scraped so much that our hosting providers are saying we now need to move up a package and that is costing thousands of pounds a year,” he said.
Dicker added that the impact is wider than just the hosting fees: “What’s the impact on the users that were trying to come to our site when the site went down? What brand impact is happening?… [and] the fact that we won’t be able to serve adverts to them at that particular time. The actual hard cost is almost impossible to actually put a number on. How do you put a number on damage to your brand?”
Dicker said Trusted Reviews has been around for 20 years and spent millions on marketing “to put it in the position it’s in. What’s the impact that all of a sudden users are coming to the site and not able to access it, or come on and have a terrible experience because of the bot activity? A fraction of the millions of pounds we spent? Add it to your bill.”
AI scraping ‘not matched by meaningful value exchange’
Stuart Forrest, global SEO director for publishing at Bauer Media, told Press Gazette that LLM bots (excluding Google) are making up an average of around 20% of total scraping on Bauer sites and at other publishers according to data he has seen.
He noted that a new tracker from software company Ahrefs shows that in July, 42% of referral traffic to 44,000 websites came from Google (pointing out that this does not only include publishers, and Google may index even higher if that was the case). Meanwhile ChatGPT, the biggest LLM by referral traffic sent, only referred 0.2% of all traffic.
“On one hand, you’ve got this growth in cost of scraping, but it’s just not matched by any kind of meaningful value exchange… for me, that’s the fundamental – that there is not a value exchange for publishers.”
Forrest added: “What do we do with that information when we’re deciding whether or not to block? Right now, just using that data, you’d say, well, we’re going to block, right, because they’re stealing our IP,” adding that publishers can know “with some confidence” that when a scrape is happening it’s to answer a real user query based on their content – but without getting the referral traffic.
He asaid the argument often made against blocking is that LLMs are leading to changing search behaviour with new platforms like ChatGPT growing fast, meaning that it will be advantageous in the long run to be present on them. But he pointed to Ahrefs data showing that just 0.2% of traffic to websites is coming to ChatGPT.
“We’ve just not seen that exponential growth in referral traffic,” he said. “Certainly, if you use referral as a business model, it’s not going to be worth it… it doesn’t feel like there’s ever going to be a similar to Google business model in which we let them have our content free, we benefit from referral traffic.”
Forrest said he was not sure if high levels of scraping would be causing a huge cost burden to many publishers currently, but noted that if CDN [content delivery network] providers are seeing an impact on their margins, they’ll likely pass on those costs to publishers as contracts come up for renewal.
The post AI bots bombard publisher websites with ‘no meaningful value exchange’ appeared first on Press Gazette.