> “OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.
The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.
Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.
I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.
I’ve been a web developer for decades as well as doing scraping, indexing, and analyzing million of sites.
Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.
This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.
As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.
It's "robots.txt", not "robot.txt". I'm not just nitpicking -- it's a clear signal the journalist has no idea what they're talking about.
That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.
OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.
dang ·6 days ago
AI companies cause most of traffic on forums - https://news.ycombinator.com/item?id=42549624 - Dec 2024 (438 comments)
ericholscher ·6 days ago
They really are trying to burn all their goodwill to the ground with this stuff.
Show replies
joelkoen ·6 days ago
The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.
Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.
I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.
Show replies
griomnib ·6 days ago
Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.
This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.
As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.
Show replies
jonas21 ·6 days ago
That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.
OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.
Show replies