科技回声

18 条评论

neonate8 个月前

<a href="https://archive.md/btyIo" rel="nofollow">https://archive.md/btyIo</a>

486sx338 个月前

It’s unfortunate and kind of dystopian. We have an opportunity to properly archive all of the worlds online data and catalog it for very very low cost (historically), so that the future of our planet will have a much better reference point for the past.Instead of that, companies are sucking up as much crap as possible, and tokenizing it and then scrubbing it, and adding “safety” to it.Reality is always much stranger than fiction.

评论 #41765908 未加载

jgrahamc8 个月前

Stuff like this is why Cloudflare launched the AI Audit feature and the ability to block "AI bots". We're about to launch a feature that'll enforce your robots.txt.

评论 #41766317 未加载

评论 #41766311 未加载

Ironlikebike8 个月前

In my last job, we observed ByteDance scraping TBs of OS testing data using the restful API that our OSS community front-end was using to serve it's CI result to the OSS community. The scraping was so relentless it was causing performance problems. We were also worried they were going to cause us large network egress fees as well. We specifically locked down the API after that, and anyone who wanted to use results had to ask explicit permission and be granted access.

Havoc8 个月前

Going to be hard to enforce anything against this if it’s happening across jurisdiction like this.I don’t see how copyright survives long term in this sort of context

评论 #41764978 未加载

评论 #41764925 未加载

评论 #41766367 未加载

评论 #41765515 未加载

MaKey8 个月前

Somehow the headline made me think of a parent with a TikTok account.

评论 #41764971 未加载

评论 #41764731 未加载

评论 #41764922 未加载

shellac8 个月前

I'm pretty sure this bot has been operating for much longer than the article suggests (April this year), and truly is a pain. I work in academia and see a lot of ill considered web scraping by ML / AI researchers, but Bytespider is in a league of its own.

benreesman8 个月前

Indiscriminate scraping is a dick move.But if you’re going to do it, do it properly. I would have hung it off the Like button with an ungodly ZooKeeper ensemble and trained a GBDT on which parts of which URLs I could just obliterate with Proxygen.We’d have it all in about 4 days. Don’t ask me how I know.The second worse thing about the AI megacorps after being evil is being staffed by people who use Cursor.Edit: on the back of the valued feedback of a valued commenter I’d like to acknowledge that I made a sloppy mistake and have corrected in haste, making no excuses. It would be super great if the largest private institutions in history of the world took the care with give or take everything that I do with trolling on a forum.

评论 #41764947 未加载

评论 #41764866 未加载

评论 #41766772 未加载

评论 #41765080 未加载

jl68 个月前

I have observed this bot requesting URLs that haven’t been live for over a decade, and to which no reference can now be found in search engines. I imagine there must be a private trade in URL lists.

评论 #41765041 未加载

koolba8 个月前

> The Bytespider bot, much like those of OpenAI and Anthropic, does not respect robots.txt, the research shows. Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website’s data.Does any of these scrapers uniquely and unambiguously identify themselves as a bot?Or are those days long over?

评论 #41764871 未加载

评论 #41764927 未加载

评论 #41764801 未加载

评论 #41764924 未加载

评论 #41764846 未加载

bilekas8 个月前

> does not respect robots.txt research shows.It would be nice then for the investigators to help people with the identifying markers for such crawlers. Apart from a mention of darkvisitors, which it seems is a paid service to "Block agents who try to ignore your robots.txt"I'm not sure how much that could be trusted given their business model also.

buro98 个月前

Also the Facebook hit scraper.Which does not respect robots.txt and definitely is just scraping.AS blocks are the only really effective tool now, there are many scrapers that do not even respect user agent

评论 #41765646 未加载

wtk8 个月前

<a href="https://archive.ph/https://fortune.com/2024/10/03/bytedance-tiktok-bytespider-scraper-bot/" rel="nofollow">https://archive.ph/https://fortune.com/2024/10/03/bytedance-...</a>

kgen8 个月前

To be honest, it's probably not enough to just block these scrapers if they are acting maliciously, people should just start serving generated content back to it and see how long it takes for them to catch on and fix the problem

OuterVale8 个月前

This pops to mind: <a href="https://youtube.com/watch?v=Hi5sd3WEh0c" rel="nofollow">https://youtube.com/watch?v=Hi5sd3WEh0c</a>

nubinetwork8 个月前

> The China-based parent company of video app TikTok released its own web crawler or scraper bot, dubbed Bytespider, sometime in AprilUh, no... bytespider has been around for a long time...

OutOfHere8 个月前

Just how would a scraper catch up with the internet if not by accelerating the rate? It is to be expected if the scraping is to succeed.

welder8 个月前

So what, who cares? Is this newsworthy? It's definitely not something to get upset about, web scraping is a normal part of the internet.

18 条评论

neonate8 个月前

<a href="https://archive.md/btyIo" rel="nofollow">https://archive.md/btyIo</a>

486sx338 个月前

评论 #41765908 未加载

jgrahamc8 个月前

Stuff like this is why Cloudflare launched the AI Audit feature and the ability to block "AI bots". We're about to launch a feature that'll enforce your robots.txt.

评论 #41766317 未加载

评论 #41766311 未加载

Ironlikebike8 个月前

Havoc8 个月前

Going to be hard to enforce anything against this if it’s happening across jurisdiction like this.I don’t see how copyright survives long term in this sort of context

评论 #41764978 未加载

评论 #41764925 未加载

评论 #41766367 未加载

评论 #41765515 未加载

MaKey8 个月前

Somehow the headline made me think of a parent with a TikTok account.

评论 #41764971 未加载

评论 #41764731 未加载

评论 #41764922 未加载

shellac8 个月前

benreesman8 个月前

评论 #41764947 未加载

评论 #41764866 未加载

评论 #41766772 未加载

评论 #41765080 未加载

jl68 个月前

评论 #41765041 未加载

koolba8 个月前

评论 #41764871 未加载

评论 #41764927 未加载

评论 #41764801 未加载

评论 #41764924 未加载

评论 #41764846 未加载

bilekas8 个月前

buro98 个月前

评论 #41765646 未加载

wtk8 个月前

<a href="https://archive.ph/https://fortune.com/2024/10/03/bytedance-tiktok-bytespider-scraper-bot/" rel="nofollow">https://archive.ph/https://fortune.com/2024/10/03/bytedance-...</a>

kgen8 个月前

OuterVale8 个月前

This pops to mind: <a href="https://youtube.com/watch?v=Hi5sd3WEh0c" rel="nofollow">https://youtube.com/watch?v=Hi5sd3WEh0c</a>

nubinetwork8 个月前

> The China-based parent company of video app TikTok released its own web crawler or scraper bot, dubbed Bytespider, sometime in AprilUh, no... bytespider has been around for a long time...

OutOfHere8 个月前

Just how would a scraper catch up with the internet if not by accelerating the rate? It is to be expected if the scraping is to succeed.

welder8 个月前

So what, who cares? Is this newsworthy? It's definitely not something to get upset about, web scraping is a normal part of the internet.

ByteDance’s Bytespider is scraping at much higher rates than other platforms

18 条评论

ByteDance’s Bytespider is scraping at much higher rates than other platforms

18 条评论