Nepenthes is a tarpit to catch AI web crawlers

714 points · blendergeek · 19 days ago

265 comments

bflesch · 19 days ago

Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.

Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.

The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.

The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.

I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.

I don't recommend you to exploit this vulnerability due to legal reasons.

[1] https://github.com/bf/security-advisories/blob/main/2025-01-...

Show replies

m3047 · 19 days ago

Having first run a bot motel in I think 2005, I'm thrilled and greatly entertained to see this taking off. When I first did it, I had crawlers lost in it literally for days; and you could tell that eventually some human would come back and try to suss the wreckage. After about a year I started seeing URLs like ../this-page-does-not-exist-hahaha.html. Sure it's an arms race but just like security is generally an afterthought these days, don't think that you can't be the woodpecker which destroys civilization. The comments are great too, this one in particular reflects my personal sentiments:

> the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do

taikahessu · 19 days ago

We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.

Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.

Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.

Show replies

Havoc · 19 days ago

What blows my mind is that this is functionally a solved problem.

The big search crawlers have been around for years & manage to mostly avoid nuking sites into oblivion. Then AI gang shows up - supposedly smartest guys around - and suddenly we're re-inventing the wheel on crawling and causing carnage in the process.

Show replies

dspillett · 19 days ago

Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.

People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.

I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.

I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.

Show replies