TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Anthropic is scraping websites so fast it's causing problems

50 点作者 michaelhoffman10 个月前

11 条评论

jsheard10 个月前
At least their bots accurately identify themselves in the User-Agent field even when they&#x27;re ignoring robots.txt, so serverside blocking is on the table for now at least.<p>Bytedances crawler (Bytespider) is another one which disregards robots.txt but still identifies itself, and you probably should block it because it&#x27;s <i>very</i> aggressive.<p>It&#x27;s going to get annoying fast when they inevitably go full blackhat and start masquerading as normal browser traffic.
评论 #41115152 未加载
评论 #41118656 未加载
ericholscher10 个月前
For those saying &quot;just use a CDN&quot;, it&#x27;s not nearly that simple. Even behind a CDN, the crawlers on our site are hitting large files that aren&#x27;t frequently accessed. This leads to large cache miss rates:<p><a href="https:&#x2F;&#x2F;fosstodon.org&#x2F;@readthedocs&#x2F;112877477202118215" rel="nofollow">https:&#x2F;&#x2F;fosstodon.org&#x2F;@readthedocs&#x2F;112877477202118215</a>
l1n10 个月前
&gt; Sites use robots.txt to tell well-behaved web crawlers what data is up for grabs and what data is off limits. Anthropic ignores it and takes your data anyway. That’s even if you’ve updated your robots.txt with the latest configuration details for Anthropic. [404 Media]<p>doesn&#x27;t seem supported by the citation, <a href="https:&#x2F;&#x2F;www.404media.co&#x2F;websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.404media.co&#x2F;websites-are-blocking-the-wrong-ai-s...</a>
JohnFen10 个月前
It didn&#x27;t take long for the &quot;responsible&quot; Anthropic to show its true colors.
ChrisArchitect10 个月前
[dupe]<p>Some more discussion <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41060559">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41060559</a>
lolpanda10 个月前
Cloudflare has a switch to block all the unknown bots other than the well behaved one. Would this be a simple solution to most of the sites? I wonder if the main concern here is that the sites don&#x27;t want to waste bandwidth&#x2F;compute for AI bots or they don&#x27;t want their content to be used for training.
jakubsuchy10 个月前
Just like Cloudflare many providers now just allow blocking: <a href="https:&#x2F;&#x2F;www.haproxy.com&#x2F;blog&#x2F;how-to-reliably-block-ai-crawlers-using-haproxy-enterprise" rel="nofollow">https:&#x2F;&#x2F;www.haproxy.com&#x2F;blog&#x2F;how-to-reliably-block-ai-crawle...</a><p>(disclaimer: I wrote this blog post)
superkuh10 个月前
I&#x27;ve noticed Anthropic bots in my logs for more than a year now and I welcome them. I&#x27;d love for their LLM to be better at what I&#x27;m interested in. I run my website off my home connection on a desktop computer and I&#x27;ve never had a problem. I&#x27;m not saying my dozens of run-ins with the anthropic bots (there have been 3 variations I&#x27;ve seen so far) are totally representative, but they&#x27;ve been respecting my robots.txt.<p>They even respect extended robots.txt features like,<p><pre><code> User-agent: * Disallow: &#x2F;library&#x2F;*.pdf$ </code></pre> I make my websites for other people to see. They are not secrets I hoard who&#x27;s value goes away when copied. The more copies and derivations the better.<p>I guess ideas like creative commons and sharing go away when the smell of money enters the water. Better lock all your text behind paywalls so the evil corporations won&#x27;t get it. Just be aware, for every incorporated entity you block you&#x27;re blocking just as many humans with false positives, if not more. This anti-&quot;scraping&quot; hysteria is mostly profit motivated.
评论 #41114213 未加载
zorrn10 个月前
I don&#x27;t know if I should Block Claude. I think it&#x27;s really good and use it regularly and I think it&#x27;s not fair to say that others should provide content.
dzonga10 个月前
what happens when ai scrappers no longer have info to scrap.<p>funny thing - with wasm, the web won&#x27;t be scrappable.
评论 #41115056 未加载
iLoveOncall10 个月前
One million hits in 24 hours is only 11 TPS, if that&#x27;s causing issues, then Anthropic isn&#x27;t the problem, your application or hosting is.
评论 #41114094 未加载