OpenAI's bot crushed this seven-person company's web site 'like a DDoS attack'

techcrunch.com

117 points · vednig · 6 days ago


104 comments
dang · 6 days ago
ericholscher · 6 days ago

Show replies

joelkoen · 6 days ago
> “OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.

The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.

Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.

I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.

Show replies

griomnib · 6 days ago
I’ve been a web developer for decades as well as doing scraping, indexing, and analyzing million of sites.

Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.

This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.

As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.

Show replies

jonas21 · 6 days ago
It's "robots.txt", not "robot.txt". I'm not just nitpicking -- it's a clear signal the journalist has no idea what they're talking about.

That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.

OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.

Show replies