TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

ByteDance’s Bytespider is scraping at much higher rates than other platforms

141 点作者 wmstack8 个月前

18 条评论

neonate8 个月前
<a href="https:&#x2F;&#x2F;archive.md&#x2F;btyIo" rel="nofollow">https:&#x2F;&#x2F;archive.md&#x2F;btyIo</a>
486sx338 个月前
It’s unfortunate and kind of dystopian. We have an opportunity to properly archive all of the worlds online data and catalog it for very very low cost (historically), so that the future of our planet will have a much better reference point for the past.<p>Instead of that, companies are sucking up as much crap as possible, and tokenizing it and then scrubbing it, and adding “safety” to it.<p>Reality is always much stranger than fiction.
评论 #41765908 未加载
jgrahamc8 个月前
Stuff like this is why Cloudflare launched the AI Audit feature and the ability to block &quot;AI bots&quot;. We&#x27;re about to launch a feature that&#x27;ll enforce your robots.txt.
评论 #41766317 未加载
评论 #41766311 未加载
Ironlikebike8 个月前
In my last job, we observed ByteDance scraping TBs of OS testing data using the restful API that our OSS community front-end was using to serve it&#x27;s CI result to the OSS community. The scraping was so relentless it was causing performance problems. We were also worried they were going to cause us large network egress fees as well. We specifically locked down the API after that, and anyone who wanted to use results had to ask explicit permission and be granted access.
Havoc8 个月前
Going to be hard to enforce anything against this if it’s happening across jurisdiction like this.<p>I don’t see how copyright survives long term in this sort of context
评论 #41764978 未加载
评论 #41764925 未加载
评论 #41766367 未加载
评论 #41765515 未加载
MaKey8 个月前
Somehow the headline made me think of a parent with a TikTok account.
评论 #41764971 未加载
评论 #41764731 未加载
评论 #41764922 未加载
shellac8 个月前
I&#x27;m pretty sure this bot has been operating for much longer than the article suggests (April this year), and truly is a pain. I work in academia and see a lot of ill considered web scraping by ML &#x2F; AI researchers, but Bytespider is in a league of its own.
benreesman8 个月前
Indiscriminate scraping is a dick move.<p>But if you’re going to do it, do it properly. I would have hung it off the Like button with an ungodly ZooKeeper ensemble and trained a GBDT on which parts of which URLs I could just obliterate with Proxygen.<p>We’d have it all in about 4 days. Don’t ask me how I know.<p>The second worse thing about the AI megacorps after being evil is being staffed by people who use Cursor.<p>Edit: on the back of the valued feedback of a valued commenter I’d like to acknowledge that I made a sloppy mistake and have corrected in haste, making no excuses. It would be super great if the largest private institutions in history of the world took the care with give or take everything that I do with trolling on a forum.
评论 #41764947 未加载
评论 #41764866 未加载
评论 #41766772 未加载
评论 #41765080 未加载
jl68 个月前
I have observed this bot requesting URLs that haven’t been live for over a decade, and to which no reference can now be found in search engines. I imagine there must be a private trade in URL lists.
评论 #41765041 未加载
koolba8 个月前
&gt; The Bytespider bot, much like those of OpenAI and Anthropic, does not respect robots.txt, the research shows. Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website’s data.<p>Does any of these scrapers uniquely and unambiguously identify themselves as a bot?<p>Or are those days long over?
评论 #41764871 未加载
评论 #41764927 未加载
评论 #41764801 未加载
评论 #41764924 未加载
评论 #41764846 未加载
bilekas8 个月前
&gt; does not respect robots.txt research shows.<p>It would be nice then for the investigators to help people with the identifying markers for such crawlers. Apart from a mention of darkvisitors, which it seems is a paid service to &quot;Block agents who try to ignore your robots.txt&quot;<p>I&#x27;m not sure how much that could be trusted given their business model also.
buro98 个月前
Also the Facebook hit scraper.<p>Which does not respect robots.txt and definitely is just scraping.<p>AS blocks are the only really effective tool now, there are many scrapers that do not even respect user agent
评论 #41765646 未加载
wtk8 个月前
<a href="https:&#x2F;&#x2F;archive.ph&#x2F;https:&#x2F;&#x2F;fortune.com&#x2F;2024&#x2F;10&#x2F;03&#x2F;bytedance-tiktok-bytespider-scraper-bot&#x2F;" rel="nofollow">https:&#x2F;&#x2F;archive.ph&#x2F;https:&#x2F;&#x2F;fortune.com&#x2F;2024&#x2F;10&#x2F;03&#x2F;bytedance-...</a>
kgen8 个月前
To be honest, it&#x27;s probably not enough to just block these scrapers if they are acting maliciously, people should just start serving generated content back to it and see how long it takes for them to catch on and fix the problem
OuterVale8 个月前
This pops to mind: <a href="https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=Hi5sd3WEh0c" rel="nofollow">https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=Hi5sd3WEh0c</a>
nubinetwork8 个月前
&gt; The China-based parent company of video app TikTok released its own web crawler or scraper bot, dubbed Bytespider, sometime in April<p>Uh, no... bytespider has been around for a long time...
OutOfHere8 个月前
Just how would a scraper catch up with the internet if not by accelerating the rate? It is to be expected if the scraping is to succeed.
welder8 个月前
So what, who cares? Is this newsworthy? It&#x27;s definitely not something to get upset about, web scraping is a normal part of the internet.