Wikipedia is struggling with voracious AI bot crawlers

91 pointsby bretpiattabout 2 months ago

23 comments

digganabout 2 months ago

This has to be one of strangest targets to crawl, since they themselves make database dumps available for download (<a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a>) and if that wasn't enough, there are 3rd party dumps as well (<a href="https://library.kiwix.org/#lang=eng&category=wikipedia" rel="nofollow">https://library.kiwix.org/#lang=eng&category=wikipedia</a>) that you could use if the official ones aren't good enough for some reason.Why would you crawl the web interface when the data is so readily available in a even better format?

评论 #43556146 未加载

评论 #43561115 未加载

评论 #43556154 未加载

评论 #43556557 未加载

评论 #43556027 未加载

评论 #43556400 未加载

评论 #43556153 未加载

评论 #43556023 未加载

评论 #43556175 未加载

评论 #43556138 未加载

评论 #43556173 未加载

评论 #43556493 未加载

评论 #43556102 未加载

评论 #43556583 未加载

评论 #43556140 未加载

delichonabout 2 months ago

We're having the same trouble for a few hundred sites that we manage. It is no problem for crawlers that obey robots.txt since we ask for one visit per 10 seconds, which is manageable. The problem seems to be mostly the greedy bots that request as fast as we can reply. So my current plan is to set rate limiting for everyone, bots or not. But doing stats on the logs, it isn't easy to figure out a limit that won't bounce legit human visitors.The bigger problem is that the LLMs are so good that their users no longer feel the need to visit these sites directly. It looks like the business model of most of our clients is becoming obsolete. My paycheck is downstream of that, and I don't see a fix for it.

评论 #43556558 未加载

aucisson_masqueabout 2 months ago

People got to make bots pay. That's the only way to get rid of this world wide DDOSing backed up by multi billions companies.There are captcha to block bots or at least make them pay money to solve them, some people in Linux community also made tools to combat that, i think something that use a little cpu energy.And in the same time, you offer an api, less expensive than the cost to crawl it, and everyone win.Multi billions companies get their sweet sweet data, Wikipedia gets money to enhance their infrastructure or whatever, users benefits from Wikipedia quality engagement.

评论 #43556232 未加载

graemepabout 2 months ago

Wikipedia provides dumps. Probably cheaper and easier than crawling it. Given the size of Wikipedia it would be well worth a little extra code. it also avoids the risk of getting blocked, and is more reliable.It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.

评论 #43556025 未加载

lazabout 2 months ago

10 years ago at Facebook we had a systems design interview question called "botnet crawl" where the set up that I'd give would be:I'm an entrepreneur who is going to get rich selling printed copies of Wikipedia. I'll pay you to fetch the content for me to print. You get 1000 compromised machines to use. Crawl Wikipedia and give me the data. Go.Some candidates would (rightfully) point out that the entirety is available as an archive, so for "interviewing purposes" we'd have to ignore that fact.If it went well, you would pivot back and forth: OK, you wrote a distributed crawler. Wikipedia hires you you to block it. What do you do? This cat and mouse game goes on indefinitely.

chuckadamsabout 2 months ago

We need to start cutting off whole ASNs of ISPs that host such crawlers and distribute a spamhaus-style block list to that effect. WP should throttle them to serve like one page per minute.

schneemsabout 2 months ago

I was on a panel with the President of Wikimedia LLC at SXSW and this was brought up. There's audio attached <a href="https://schedule.sxsw.com/2025/events/PP153044" rel="nofollow">https://schedule.sxsw.com/2025/events/PP153044</a>.I also like Anna's (Creative Commons) framing of the problem being money + attribution + reciprocity.

PeterStuerabout 2 months ago

The weird thing is their own data does not reflect this at all. The number of articles accessed by users, spiders and bots alike has not moved significantly over the last few years. Why these strange wordings like "65 percent of the resource-consuming traffic"? Is there non-resource consuming traffic? Is this just another fundraising marketing drive? Wikimedia has been know to be less than truthful wrt their funding needs and spent.<a href="https://stats.wikimedia.org/#/all-projects/reading/total-page-views/normal|bar|2020-02-01~2025-05-01|(access)~desktop*mobile-app*mobile-web+agent~user*spider*automated|monthly" rel="nofollow">https://stats.wikimedia.org/#/all-projects/reading/total-pag...</a>

评论 #43556754 未加载

评论 #43556661 未加载

perching_aixabout 2 months ago

I thought all of Wikipedia can be downloaded directly if that's the goal? [0] Why scrape?[0] <a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a>

评论 #43556228 未加载

评论 #43556257 未加载

wslhabout 2 months ago

Wouldn't downloading the publicly available Wikipedia database (e.g. via Torrent [1]) be enough for AI training purposes? I get that this doesn't actually stop AI bots, but captchas and other restrictions would undermine the open nature of Wikipedia.[1] <a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a>

评论 #43555990 未加载

jervenabout 2 months ago

Working for an open-data project, I am starting to believe that the AI companies are basically criminal enterprises. If I did this kind of thing to them they would call the cops and say I am a criminal for breaking TOS and doing a DDOS, therefore they are likely to be criminal organizations and their CEOs should be in Alcatraz.

shreyshnaccountabout 2 months ago

they will ddos the open internet to the point where only big tech will be able to afford to host even the most basic websites? is that the endgame?

insane_dreamerabout 2 months ago

> "expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement."these multi$B corps continue to leech off of everyone's labors, and no one seems able to stop them; at what level can entities take action? the courts? legislation?we've basically handed over the Internet to a cabal of Big Tech

iJohnDoeabout 2 months ago

Not sure why these AI companies need to scrape and crawl. Just seems like a waste when companies like OpenAI have already done this.Obviously, OpenAI won't share their dataset. It's part of their competitive stance.I don't have a point or solution. However, it seems wasteful for non-experts to be gathering the same data and reinventing the wheel.

qwertoxabout 2 months ago

Maybe the big tech providers should play fair and host the downloadable database for those bots as well as crawlable mirrors.

skydhashabout 2 months ago

The worst thing about that is that wikipedia has dumps of all its data which you can download.

microtherionabout 2 months ago

Not just Wikipedia. My home server (hosting a number of not particularly noteworthy things, such as my personal gitea instance) has been absolutely hammered in recent months, to the extent of periodically bringing down the server for hours with thrashing.The worst part is that every single sociopathic company in the world seems to have simultaneously unleashed their own fleet of crawlers.Most of the bots downright ignore robots.txt, and some of the crawlers hit the site simultaneously from several IPs. I've been trying to lure the bots into a nepenthes tarpit, which somewhat helps, but ultimately find myself having to firewall entire IP ranges.

评论 #43557208 未加载

lambdaoneabout 2 months ago

It's not just Wikipedia - the entire rest of the open-access web is suffering with them.I think the most interesting thing here is that it shows that the companies doing these crawls simply don't care who they hurt, as they actively take measures to prevent their victims from stopping them by using multiple IP addresses, snowshoe crawling, evading fingerprinting, and so on.For Wikipedia, there's a solution served up to them on a plate. But they simply can't be bothered to take it.And this in turn shows the overall moral standards of those companies - it's the wild west out there, where the weak go to the wall, and those inflicting the damage know what they're doing, and just don't care. Sociopaths.

评论 #43556167 未加载

ddtaylorabout 2 months ago

IPFS

greenavocadoabout 2 months ago

Careless crawler scum will put an end to the open Internet

amazingamazingabout 2 months ago

what's the best way to stop the bots? cloudflare?

评论 #43557260 未加载

评论 #43556847 未加载

评论 #43556397 未加载

评论 #43556365 未加载

评论 #43556326 未加载

fareeshabout 2 months ago

"async_visit_link for link in links omg it works"

m101about 2 months ago

Wikipedia spends 1% of its budget on hosting fees. It can spend a bit more given the rest of their corruptions.

评论 #43557504 未加载

评论 #43556284 未加载

评论 #43556904 未加载

评论 #43556295 未加载