Perversely, this submission is essentially blogspam. The article linked in the second paragraph, to which this "1 minute" read adds almost nothing of value, is the main story:<p><<a href="https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/" rel="nofollow">https://thelibre.news/foss-infrastructure-is-under-attack-by...</a>><p>394 comments. 645 points. Submitted 3 hours ago:
<<a href="https://news.ycombinator.com/item?id=43422413">https://news.ycombinator.com/item?id=43422413</a>>
I might be naive, but I think it's time we seriously start implementing "HTTP status code 402: Payment Required" across the board.<p>"L402" is an interesting proposal. Paying a fraction of a penny per request.
<a href="https://github.com/l402-protocol/l402" rel="nofollow">https://github.com/l402-protocol/l402</a>
There's a real economic problem here: when someone scrapes your site, you're literally paying for them to use your stuff. That's messed up (and not sustainable)<p>It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.<p>L402 can help here.<p><a href="https://l402.org" rel="nofollow">https://l402.org</a>
Rate limiting is the first step before cutting everything off behind forced logins.<p>> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly<p>FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.
Linked in the article that this article links to is a project I found interesting for combatting this problem, a (non-crypto) proof-of-work challenge for new visitors <a href="https://github.com/TecharoHQ/anubis" rel="nofollow">https://github.com/TecharoHQ/anubis</a><p>Looks like the GNOME Gitlab instance implements it: <a href="https://gitlab.gnome.org/GNOME" rel="nofollow">https://gitlab.gnome.org/GNOME</a>
We should try separating good bots from bad bots:<p>Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.<p>Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.<p>So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.
> How long until scrapers start hammering Mastodon servers?<p>Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.
I would think all you need to do is add a copyright statement of some kind.<p>Sad things are getting to this point. Maybe I should add this to my site :)<p>(c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.
Crawlers visiting every page on your website is not the main problem with the unauthenticated web.<p>The amount of spam that happens when you let people freely post is a much bigger problem.
To be honest I feel that web2 is overrated.<p>Most of content, blogs could be static sites.<p>For mastodon, forums I think user validation is ok and a good way to go.
Could an answer here be for smaller websites to convert their sites into chatbots which could prevent AI scrapers from slurping up all their content/drive up their hosting costs?
> I suggest everyone that uses cloud infrastructure for hosting set-up a billing limit to avoid an unexpected bill in case they're caught in the cross-hairs of a negligent company. All the abusers anonymize their usage at this point, so good luck trying to get compensated for damages.<p>This is scary
Pretty soon virtually everything will be paywalled.
Ironically, it will provide us with a good metric that lets us find out whether AGI has arrive or not: when it does, paywalling will stop working because AGI could derive more value from accessing things and will thus outbid us.
Everyone is (rightfully) outraged, but this is essentially nothing new. Asshat capitalists have been externalizing the costs of their asshat moneymaking schemes on the little guy since approximately forever.<p>Deregulation is ultimately antithetical to our personal freedom.<p>I just hope the spirit of the internet that I grew up with can be rescued, or reincarnated somehow...
Yet another entry in the long and shameful history of Silicon Valley abusing the public square for their own profit (or in this case, fantasies of profit) and the rest of us just have to learn to live with it because the justice system simply will not even try and give us recourse.<p>Move fast and break things apparently has a bonus clause for the things you break not being your responsibility to fix.
For some reason I am not really moved by a lot of the hand wringing I am seeing lately.<p>It's a not a binary thing to me: LLMs are not god, but even without AGI, they have proven wildly useful to me. Calling them "shitty chat bots" doesn't sway me.<p>Further I have always assumed that everything that I post to the web is publicly accessible to everyone/everything. We lost any battle we thought we could wage some 2+ decades ago when web crawlers started hoovering up data from our sites.