OpenAI crawler burning money for nothing

12 points · babuskov · 1 days ago

I have a bunch of blog posts, with URLs like these:

  https://mywebsite/1-post-title
  https://mywebsite/2-post-title-second
  https://mywebsite/3-post-title-third
  https://mywebsite/4-etc

For some reason, it tries every combination of numbers, so the requests look like this:

  https://mywebsite/1-post-title/2-post-title-second
  https://mywebsite/1-post-title/3-post-title-third

etc.

Since the blog engine simply discards everything after number (1,2,3...) and just serves the content for blog post #1, #2, #3,... the web server returns a valid page. However, all those pages are the same.

The main problem here is that there is no website page that has such compound links like https://mywebsite/1-post-title/2-post-title-second

So it's clearly some bug in the crawler.

Maybe OpenAI is using AI code for their crawler because it has so dumb bugs you cannot believe any human would write it.

They will make 90000 requests to load my small blog with 300 posts.

Cannot imagine what happens with larger websites that have thousands of blog posts.

10 comments

markus_zhang · 1 days ago

I wonder if one can build maze webpages to trap these AI crawlers. So if it's a human it doesn't bother, but once identified as a crawler it dynamically generates webpages after webpages of garbage. It doesn't need to save all those garbage but the crawler has to.

codemusings · 22 hours ago

For what it's worth: they do honor the robots.txt file. I had the same problem with a client's CMS and denying all AI crawler user agents did the trick.

It's clear they've all gone mad. The traffic spiked 400% overnight and made the CMS unresponsive a few times a day.

readyplayernull · 1 days ago

They are decided to set the web on fire:

https://news.ycombinator.com/item?id=42660377

gbertb · 1 days ago

how are the links structured in the ahref tag? is it relative or absolute? if relative, then thats prob why.

Show replies

101008 · 1 days ago

Cloudflare should provide a service (paid or free) to block AI crawlers.

Show replies