Nepenthes is a tarpit to catch AI web crawlers

714 点作者 blendergeek4 个月前

50 条评论

bflesch4 个月前

Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.I don't recommend you to exploit this vulnerability due to legal reasons.[1] <a href="https://github.com/bf/security-advisories/blob/main/2025-01-ChatGPT-Crawler-Reflective-DDOS-Vulnerability.md">https://github.com/bf/security-advisories/blob/main/2025-01-...</a>

评论 #42727528 未加载

评论 #42727288 未加载

评论 #42727356 未加载

评论 #42738239 未加载

评论 #42777344 未加载

评论 #42777350 未加载

评论 #42742714 未加载

评论 #42792278 未加载

评论 #42748667 未加载

评论 #42733203 未加载

评论 #42727530 未加载

评论 #42733949 未加载

m30474 个月前

Having first run a bot motel in I think 2005, I'm thrilled and greatly entertained to see this taking off. When I first did it, I had crawlers lost in it literally for days; and you could tell that eventually some human would come back and try to suss the wreckage. After about a year I started seeing URLs like ../this-page-does-not-exist-hahaha.html. Sure it's an arms race but just like security is generally an afterthought these days, don't think that you can't be the woodpecker which destroys civilization. The comments are great too, this one in particular reflects my personal sentiments:> the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do

taikahessu4 个月前

We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.

评论 #42726365 未加载

评论 #42735381 未加载

评论 #42740706 未加载

评论 #42743952 未加载

Havoc4 个月前

What blows my mind is that this is functionally a solved problem.The big search crawlers have been around for years & manage to mostly avoid nuking sites into oblivion. Then AI gang shows up - supposedly smartest guys around - and suddenly we're re-inventing the wheel on crawling and causing carnage in the process.

评论 #42736252 未加载

评论 #42737200 未加载

dspillett4 个月前

Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.

评论 #42732604 未加载

评论 #42730654 未加载

评论 #42730571 未加载

评论 #42727228 未加载

kerkeslager4 个月前

Question: do these bots not respect robots.txt?I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

评论 #42727959 未加载

评论 #42751668 未加载

评论 #42727689 未加载

评论 #42727693 未加载

pona-a4 个月前

It feels like a Markov chain isn't adversarial enough.Maybe you can use an open-weights model, assuming that all LLMs converge on similar representations, and use beam-search with inverted probability and repetition penalty or just GPT-2/LLaMA outwith with amplified activations to try and bork the projection matrices, return write pages and pages of phonetically faux English text to affect how the BPE tokenizer gets fitted, or anything else more sophisticated and deliberate than random noise.All of these would take more resources than a Markov chain, but if the scraper is smart about ignoring such link traps, a periodically rotated selection of adversarial examples might be even better.Nightshade had comparatively great success, discounting that its perturbations aren't that robust to rescaling. LLM training corpora are filtered very coarsely and take all they can get, unlike the more motivated attacker in Nightshade's threat model trying to fine-tune on one's style. Text is also quite hard to alter without a human noticing, except annoying zero-width Unicode which is easily stripped, so there's no presence of preserving legibility; I think it might work very well if seriously attempted.

评论 #42745627 未加载

quchen4 个月前

Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.

评论 #42726426 未加载

评论 #42725983 未加载

评论 #42728923 未加载

评论 #42727567 未加载

评论 #42726352 未加载

评论 #42726183 未加载

评论 #42725708 未加载

评论 #42730108 未加载

评论 #42725957 未加载

jjuhl3 个月前

Why just catch the ones ignoring robots.txt? Why not explicitly allow them to crawl everything, but silently detect AI bots and quietly corrupt the real content so it becomes garbage to them while leaving it unaltered for real humans? Seems to me that would have a greater chance of actually poisoning their models and eventually make this AI/LLM crap go away.

hartator4 个月前

There are already “infinite” websites like these on the Internet.Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.Unknown websites will get very few crawls per day whereas popular sites millions.Source: I am the CEO of SerpApi.

评论 #42727553 未加载

评论 #42727737 未加载

评论 #42726572 未加载

评论 #42728522 未加载

评论 #42726258 未加载

评论 #42726093 未加载

评论 #42727760 未加载

评论 #42728210 未加载

评论 #42742537 未加载

benlivengood4 个月前

A little humorous; it's a 502 Bad Gateway error right now and I don't know if I am classified as an AI web crawler or it's just overloaded.

评论 #42731672 未加载

a_c4 个月前

We need a tarpit that feed AI their own hallucination. Make the habsburg dynasty of AI a reality

评论 #42735705 未加载

rvz4 个月前

Good.We finally have a viable mouse trap for LLM scrapers for them to continuously scrape garbage forever, depleting the host of their resources whilst the LLM is fed garbage which the result will be unusable to the trainer, accelerating model collapse.It is like a never ending fast food restaurant for LLMs forced to eat garbage input and will destroy the quality of the model when used later.Hope to see this sort of defense used widely to protect websites from LLM scrapers.

评论 #42726807 未加载

btbuildem4 个月前

> ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTSBug, or feature, this? Could be a way to keep your site public yet unfindable.

评论 #42727107 未加载

dilDDoS4 个月前

I appreciate the intent behind this, but like others have pointed out, this is more likely to DOS your own website than accomplish the true goal.Probably unethical or not possible, but you could maybe spin up a bunch of static pages on GitHub Pages with random filler text and then have your site redirect to a random one of those instead. Unless web crawlers don’t follow redirects.

grajaganDev4 个月前

This keeps generating new pages to keep the crawler occupied.Looks like this would tarpit any web crawler.

评论 #42725575 未加载

mmaunder4 个月前

To be truly malicious it should appear to be valuable content but rife with AI hallucinogenics. Best to generate it with a low cost model and prompt the model to trip balls.

评论 #42732426 未加载

griomnib4 个月前

A simpler approach I’m considering is just sending 100 garbage HTTP requests for each garbage HTTP request they send me. You could just have a cron job parse the user agents from access logs once an hour and blast the bastards.

hubraumhugo4 个月前

The arms race between AI bots and bot-protection is only going to get worse, leading to increasing infra costs while negatively impacting the UX and performance (captchas, rate limiting, etc.).What's a reasonable way forward to deal with more bots than humans on the internet?

评论 #42730177 未加载

评论 #42730252 未加载

评论 #42730433 未加载

评论 #42743430 未加载

pera4 个月前

Does anyone know if there is anything like Nepenthes but that implements data poisoning attacks like <a href="https://arxiv.org/abs/2408.02946" rel="nofollow">https://arxiv.org/abs/2408.02946</a>

评论 #42726459 未加载

评论 #42726445 未加载

RamblingCTO4 个月前

Why wouldn't a max-depth (which I always implement in my crawlers if I write any) prevent any issues you'd have? Am I overlooking something? Or does it run under the assumption that the crawlers they are targeting are so greedy that they don't have max-depth/a max number of pages for a domain?

NathanKP4 个月前

This looks extremely easy to detect and filter out. For example: <a href="https://i.imgur.com/hpMrLFT.png" rel="nofollow">https://i.imgur.com/hpMrLFT.png</a>In short, if the creator of this thinks that it will actually trick AI web crawlers, in reality it would take about 5 mins of time to write a simple check that filters out and bans the site from crawling. With modern LLM workflows its actually fairly simple and cheap to burn just a little bit of GPU time to check if the data you are crawling is decent.Only a really, really bad crawl bot would fall for this. The funny thing is that in order to make something that an AI crawler bot would actually fall for you'd have to use LLM's to generate realistic enough looking content. Markov chain isn't going to cut it.

评论 #42732655 未加载

评论 #42729921 未加载

marckohlbrugge4 个月前

OpenAI doesn’t take security seriously.I reported a vulnerability to them that allowed you to get IP addresses of their paying customers.OpenAI responded “Not applicable” indicating they don’t think it was a serious issue.The PoC was very easy to understand and simple to replicate.Edit: I guess I might as well disclose it here since they don’t consider it an issue. They were/are(?) hot linking logo images of third-party plugins. When you open their plugin store it loads a couple dozen of them instantly. This allows those plugin developers (of which there are many) to track the IP addresses and possibly more of who made these requests. It’s straight forward to become a plugin developer and get included. IP tracking is invisible to the user and OpenAI. A simple fix is to proxy these images and/or cache them on the OpenAI server.

评论 #42729779 未加载

评论 #42732429 未加载

nerdix4 个月前

Are the big players (minus Google since no one blocks google bot) actively taking measures to circumvent things like Cloudflare bot protection?Bot detection is fairly sophisticated these days. No one bypasses it by accident. If they are getting around it then they are doing it intentionally (and probably dedicating a lot of resources to it). I'm pro-scraping when bots are well behaved but the circumvention of bot detection seems like a gray-ish area.And, yes, I know about Facebook training on copyrighted books so I don't put it above these companies. I've just never seen it confirmed that they actually do it.

评论 #42730641 未加载

numba8883 个月前

AIs are new search engines today. So, you need to decide if you want visibility or not. If yes then blocking is like hitting yourself in the balls. While legal it can be painful if you 'succeed'.

reginald784 个月前

Is there a reason people can't use hashcash or some other proof of work system on these bad citizen crawlers?

huac4 个月前

from an AI research perspective -- it's pretty straightforward to mitigate this attack1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. <a href="https://arxiv.org/abs/2404.05405" rel="nofollow">https://arxiv.org/abs/2404.05405</a> and also another recent paper that I don't recall...

评论 #42736601 未加载

评论 #42736451 未加载

deadbabe4 个月前

Does anyone have a convenient way to create a Markov babbler from the entire corpus of Hackernews text?

GaggiX4 个月前

As always, I find it hilarious that some people believe that these companies will train their flagship model on uncurated data, and that text generated by a Markov chain will not be filtered out.

评论 #42730667 未加载

monkaiju4 个月前

Fantastic! Hopefully this not only leads to model collapse but also damages the search engines who have broken the contract they had with site makers.

Dwedit4 个月前

The article claims that using this will "cause your site to disappear from all search results", but the generated pages don't have the traditional "meta" tags that state the intention to block robots.<meta name="robots" content="noindex, nofollow">Are any search engines respecting that classic meta tag?

评论 #42730036 未加载

ycombinatrix4 个月前

So this is basically endlessh for HTTP? Why not feed AI web crawlers with nonsense information instead?

upwardbound24 个月前

Is Nepenthes being mirrored in enough places to keep the community going if the original author gets any DMCA trouble or anything? I'd be happy to host a mirror but am pretty busy and I don't want to miss a critical file by accident.

评论 #42734736 未加载

klez4 个月前

Not to be confused with the apparently now defunct Nepenthes malware honeypot.I used to use it when I collected malware.Archived site: <a href="https://web.archive.org/web/20090122063005/http://nepenthes.mwcollect.org/" rel="nofollow">https://web.archive.org/web/20090122063005/http://nepenthes....</a>Github mirror: <a href="https://github.com/honeypotarchive/nepenthes">https://github.com/honeypotarchive/nepenthes</a>

DigiEggz4 个月前

Amazing project. I hope to see this put to serious use.As a quick note and not sure if it's already been mentioned, but the main blurb has a typo: "... go back into a the tarpit"

davidw4 个月前

Is the source code hosted somewhere in something like GitHub?

ggm4 个月前

Wouldn't it be better to perform random early drop in the path. Surely better slowdown than forced time delays in your own server?

phito4 个月前

As a carnivorous plant enthusiast, I love the name.

评论 #42736052 未加载

yapyap4 个月前

very nice, I remember seeing a writeup on someone that had basically done the same thing as a coding test or something of the like (before LLM crawlers) was catching / getting harassed by LLMs ignoring the robots.txt to scrape his website. on accident of course since he had made his website before the times of LLM scraping

grahamj4 个月前

That’s so funny, I’ve thought of this exact idea several times over the last couple of weeks. As usual someone beat me to it :D

sedatk4 个月前

Both ChatGPT 4o and Claude 3.5 Sonnet can identify the generated page content as "random words".

评论 #42736142 未加载

Mr_Bees694 个月前

please add a robots.txt, its quite a d### move to people who build responsible crawlers for fun.

评论 #42741077 未加载

Dig1t4 个月前

Could a human detect that this site is a tarpit?If so, then an AI crawler almost certainly can as well.

arend3214 个月前

I'm actually quite happy with AI crawlers. I recently found out chatgpt suggest one of my sites when asked to suggest a good, independent site that covered the topic I searched for. Especially now that for instance chatgpt is adding source links, I think we should treat AI crawlers the same as search engine crawlers.

anocendi4 个月前

Similar concept to SpiderTrap tool infosec folks use for active defense.

sharpshadow4 个月前

Would various decompression bombs work to increase the load?

bloomingkales4 个月前

Wouldn’t an LLM be smart enough to spot a tarpit?

评论 #42743492 未加载

ddmma4 个月前

Server extension package

guluarte4 个月前

markov chains?

at_a_remove4 个月前

I have a very vague concept for this, with a different implementation.Some, uh, sites (forums?) have content that the AI crawlers would like to consume, and, from what I have heard, the crawlers can irresponsibly hammer the traffic of said sites into oblivion.What if, for the sites which are paywalled, the signup, which invariably comes with a long click-through EULA, had a legal trap within it, forbidding ingestion by AI models on pain of, say, owning ten percent of the company should this be violated. Make sure there is some kind of token payment to get to the content.Then seed the site with a few instances of hapax legomenon. Trace the crawler back and get the resulting model to vomit back the originating info, as proof.This should result in either crawlers being more respectful or the end of the hated click-through EULA. We win either way.

评论 #42726319 未加载

评论 #42726182 未加载

评论 #42726701 未加载

评论 #42725992 未加载