TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Nearly 90% of our AI crawler traffic is from ByteDance

95 pointsby jcat1237 months ago

8 comments

mmastrac7 months ago
I found that I was getting random bot attacks on progscrape.com with no identifiable bot signature (ie: a signature matching a valid Chrome Desktop client), but at a rate that was only possible via bot. I ended up having to add token buckets by IP&#x2F;User Agent to help avoid this deluge of traffic.<p>Agents that trigger the first level of rate-limiting go through a &quot;tarpit&quot; that holds their connection for a bit before serving it which seems to keep most of the bad actors in check. It&#x27;s impossible to block them via robots.txt, and I&#x27;m trying to avoid using too big of a hammer on my CloudFlare settings.<p>EDIT: checking the logs, it seems that the only bot getting tarpitted right now is OpenAI, and they _do_ have a GPTBot signature:<p><pre><code> 2024-10-31T02:30:23.312139Z WARN progscrape::web: User hit soft rate limit: ratelimit=soft ip=&quot;20.171.206.77&quot; browser=Some(&quot;Mozilla&#x2F;5.0 AppleWebKit&#x2F;537.36 (KHTML, like Gecko; compatible; GPTBot&#x2F;1.2; +https:&#x2F;&#x2F;openai.com&#x2F;gptbot)&quot;) method=GET uri=&#x2F;?search=science.org</code></pre>
评论 #42012118 未加载
jhpacker7 months ago
Cloudflare radar, which presumably a much bigger and better sample, reports Bytespider as the #5 AI Crawler behind FB, Amazon, GPTBot, and Google: <a href="https:&#x2F;&#x2F;radar.cloudflare.com&#x2F;explorer?dataSet=ai.bots" rel="nofollow">https:&#x2F;&#x2F;radar.cloudflare.com&#x2F;explorer?dataSet=ai.bots</a> And that&#x27;s not including the most of highest volume spiders overall like Googlebot, Bingbot, Yandex, Ahrefs, etc.<p>Not to say it isn&#x27;t an issue, but that Forture article they reference is pretty alarmist and thin on detail.
评论 #42009896 未加载
neilv7 months ago
Given the high-profile national security scrutiny that ByteDance was already in over TikTok, and now with the AI training competitiveness on national authorities&#x27; minds, maybe this behavior by ByteDance is on the radar of someone who&#x27;s thinking of whether CFAA or other regulation applies.<p>As someone who&#x27;s built multiple (respectful) Web crawlers, for academic research and for respectable commerce, I&#x27;m wondering whether abusers are going to make it harder for legitimate crawlers to operate.
wtf2427 months ago
I had the same issue with TikTok&#x2F;ByteDance. They were using almost 100gb of my traffic per month.<p>I now block all ai crawlers at the cloudflare WAF level. On Monday I noticed a HUGE spike in traffic and my site was not handling it well. After a lot of troubleshooting and log parsing, I was getting millions of requests from China that were getting past cloudflare&#x27;s bot protection.<p>I ended up having to force a CF managed challenge for the entire country of China to get my site back in a normal working state.<p>In the past 24 hours CF has blocked 1.66M bot requests. Good luck running a site without using CloudFlare or something similar.<p>AI crawlers are just out of control
PittleyDunkin7 months ago
How do you differentiate between &quot;ai&quot; (whatever that means) and other crawlers?
评论 #42009797 未加载
评论 #42009795 未加载
odc7 months ago
Good to know there are other solutions than Cloudflare to block those leeches.
sghiassy7 months ago
It’s 90% of 1%… title is misleading
评论 #42009841 未加载
评论 #42009861 未加载
yazzku7 months ago
tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.
评论 #42009809 未加载
评论 #42009918 未加载
评论 #42009984 未加载