TechEcho

8 comments

mmastrac7 months ago

I found that I was getting random bot attacks on progscrape.com with no identifiable bot signature (ie: a signature matching a valid Chrome Desktop client), but at a rate that was only possible via bot. I ended up having to add token buckets by IP/User Agent to help avoid this deluge of traffic.Agents that trigger the first level of rate-limiting go through a "tarpit" that holds their connection for a bit before serving it which seems to keep most of the bad actors in check. It's impossible to block them via robots.txt, and I'm trying to avoid using too big of a hammer on my CloudFlare settings.EDIT: checking the logs, it seems that the only bot getting tarpitted right now is OpenAI, and they _do_ have a GPTBot signature:<pre><code> 2024-10-31T02:30:23.312139Z WARN progscrape::web: User hit soft rate limit: ratelimit=soft ip="20.171.206.77" browser=Some("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)") method=GET uri=/?search=science.org</code></pre>

评论 #42012118 未加载

jhpacker7 months ago

Cloudflare radar, which presumably a much bigger and better sample, reports Bytespider as the #5 AI Crawler behind FB, Amazon, GPTBot, and Google: <a href="https://radar.cloudflare.com/explorer?dataSet=ai.bots" rel="nofollow">https://radar.cloudflare.com/explorer?dataSet=ai.bots</a> And that's not including the most of highest volume spiders overall like Googlebot, Bingbot, Yandex, Ahrefs, etc.Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.

评论 #42009896 未加载

neilv7 months ago

Given the high-profile national security scrutiny that ByteDance was already in over TikTok, and now with the AI training competitiveness on national authorities' minds, maybe this behavior by ByteDance is on the radar of someone who's thinking of whether CFAA or other regulation applies.As someone who's built multiple (respectful) Web crawlers, for academic research and for respectable commerce, I'm wondering whether abusers are going to make it harder for legitimate crawlers to operate.

wtf2427 months ago

I had the same issue with TikTok/ByteDance. They were using almost 100gb of my traffic per month.I now block all ai crawlers at the cloudflare WAF level. On Monday I noticed a HUGE spike in traffic and my site was not handling it well. After a lot of troubleshooting and log parsing, I was getting millions of requests from China that were getting past cloudflare's bot protection.I ended up having to force a CF managed challenge for the entire country of China to get my site back in a normal working state.In the past 24 hours CF has blocked 1.66M bot requests. Good luck running a site without using CloudFlare or something similar.AI crawlers are just out of control

PittleyDunkin7 months ago

How do you differentiate between "ai" (whatever that means) and other crawlers?

评论 #42009797 未加载

评论 #42009795 未加载

odc7 months ago

Good to know there are other solutions than Cloudflare to block those leeches.

sghiassy7 months ago

It’s 90% of 1%… title is misleading

评论 #42009841 未加载

评论 #42009861 未加载

yazzku7 months ago

tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.

评论 #42009809 未加载

评论 #42009918 未加载

评论 #42009984 未加载

8 comments

mmastrac7 months ago

评论 #42012118 未加载

jhpacker7 months ago

评论 #42009896 未加载

neilv7 months ago

wtf2427 months ago

PittleyDunkin7 months ago

How do you differentiate between "ai" (whatever that means) and other crawlers?

评论 #42009797 未加载

评论 #42009795 未加载

odc7 months ago

Good to know there are other solutions than Cloudflare to block those leeches.

sghiassy7 months ago

It’s 90% of 1%… title is misleading

评论 #42009841 未加载

评论 #42009861 未加载

yazzku7 months ago

tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.

评论 #42009809 未加载

评论 #42009918 未加载

评论 #42009984 未加载

Nearly 90% of our AI crawler traffic is from ByteDance

8 comments

Nearly 90% of our AI crawler traffic is from ByteDance

8 comments