TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

77 点作者 hilux4 个月前

14 条评论

dangle14 个月前
Related to previous:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42725147">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42725147</a>
评论 #42859559 未加载
BizarreByte4 个月前
AI haters? I don&#x27;t hate AI I just don&#x27;t want things I&#x27;ve created being used to enrich multi-billion dollar companies for free. These companies are behaving poorly and they should expect this kind of push back.
评论 #42860210 未加载
评论 #42859238 未加载
评论 #42863638 未加载
评论 #42862678 未加载
评论 #42859375 未加载
评论 #42859242 未加载
marginalia_nu4 个月前
To be fair there&#x27;s tens of thousands of content farms filling the web with ai slop. That&#x27;s far more likely to harm AI scrapers than these hijinks.<p>Most crawlers use some form of timeout mechanism, usually informed by some priority scheduling. This deals reasonably well with crawler traps.<p>Since Nephentes-like traps are getting so common now (and in particular, not always behind robots.txt), I added a clause to Marginalia&#x27;s crawler that prevents it from extracting links from pages that are less than 2 Kb and take more than 9 seconds to load. It&#x27;s 4 lines of code and means the crawler doesn&#x27;t get stuck at all.<p>I totally get the frustration though. My sites get an insane amount of bot traffic as well. I think roughly 1% of the search traffic to the html endpoint is human, and that&#x27;s while providing a free API they could use instead. ... I just don&#x27;t think this is going to fix anything.
uberman4 个月前
Article seems super biased to me. Why are tarpits repeatedly characterized as attacks rather than as defense?
评论 #42859319 未加载
评论 #42859995 未加载
nickphx4 个月前
I block them because they&#x27;re not paying me to use my resources. They would block me if I made a similar volume of requests.
Dwedit4 个月前
There&#x27;s also the other kind of AI haters who do not give any anti-bot indicators about their tarpits (no robots.txt entry, no &quot;nofollow&quot;, etc), and want to intentionally feed them poisoned data.
评论 #42859371 未加载
评论 #42859527 未加载
评论 #42859302 未加载
freitasm4 个月前
&gt; AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt<p>&gt; That&#x27;s likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.<p>Website owners aren&#x27;t &quot;haters&quot; if bots ignore robots.txt, consuming resources that translate to expenses and a bad experience for legitimate website visitors.<p>Why would the website owner have to commission much larger server(s), pay more for this traffic and get nothing in return? At least search engines send human visitors your way.<p>It&#x27;s not &quot;AI haters&quot;. It&#x27;s exploitation hating.
anotherhue4 个月前
I wonder if we could run the cheapest smallest 1b model (or smaller) llm to generate this data in a way that&#x27;s just plausible enough to be ingested.<p>A little note in robots.txt offering commercial terms would also be available.
评论 #42859217 未加载
rglover4 个月前
I&#x27;ve yet to do it, but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go...hard.<p>Calling people who don&#x27;t want certain things hoovered up by an LLM &quot;AI haters&quot; is a level of manipulation I&#x27;d think was only reserved for someone with a vested interest in the tech. Just encourages devious behavior instead of a more diplomatic approach of respecting people&#x27;s wishes (read: robots.txt).
评论 #42863669 未加载
bsnnkv4 个月前
I run a service[1] that was getting hit pretty hard by these crawlers.<p>Ultimately instead of going down this path, I decided to just start charging for access to the service (it was long overdue)[2].<p>Users who are logged out can still see old cached content (which is a single DB read op), but to aggregate new content requires an account. I feel like this is a good (enough) middleground solution for now.<p>[1]: <a href="https:&#x2F;&#x2F;kulli.sh" rel="nofollow">https:&#x2F;&#x2F;kulli.sh</a><p>[2]: <a href="https:&#x2F;&#x2F;lgug2z.com&#x2F;articles&#x2F;in-the-age-of-ai-crawlers-i-have-chosen-to-paywall&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lgug2z.com&#x2F;articles&#x2F;in-the-age-of-ai-crawlers-i-have...</a>
amelius4 个月前
What if the scrapers use breadth-first search?
dinobones4 个月前
ok...<p>if depth &gt; 5 and if sem_hash(content) in hist: return
评论 #42859158 未加载
评论 #42859114 未加载
andrewfromx4 个月前
&quot;trap crawlers in infinite mazes of gibberish data, potentially increasing AI training costs and poisoning datasets. While their effectiveness is debated, creators see them as a form of resistance against unchecked AI development.&quot;
andrewmutz4 个月前
Regardless of your views on AI, LLMs are going to be influential in the future. If you work to keep your content away from models, it&#x27;s hard to see how you benefit.<p>25 years ago, if you had blocked the googlebot scraper because you resented google search, it would only have worked to marginalize the information you were offering up on the internet. Avoiding LLM training datasets will lead to similar outcomes.
评论 #42859588 未加载