I had to lock-down my private Gitea server when I noticed my commits were taking forever, because my meager 2-CPU instance was pegged.<p>Tail the nginx logs, sure enough some jerk is asking for every URL for every git commit ever done, with no delays/backoffs/anything. Just hammer the ever-loving crap out of me. Lovely, GTFO!<p>The simplest thing to do: add HTTP Basic auth, now my git server is no longer accessible to the public. Thanks AI startups! Maybe I'll re-enable after this craze is over.
Don't forget to set "Accept-Encoding" to "identity", you wouldn't want to waste valuable CPU cycles on decompression. You need those for training!
> This text is satirical in nature.<p>I usually reject this prefix/postfix as damaging to the spirit of the post. Ruining the art as it were.<p>Unfortunately, I think in this case, it's required. I run a novel git host and I've seen bots who lie about their UA crawl exclusively code. Ignoring the git history only following links with an extension. It's a git host, if you choose to crawl the web interface instead of cloning the repo, you're too stupid to also pick up this is satire, and would likely follow the other suggestions intentionally. Same goes for those bots that crawl wikipedia instead of downloading one of the prepackaged archives. Bot authors: "Why are you the way you are?"<p>It's refreshing to read some humor about the state of things. There's too much frustration, vitriol and anger. Justified as it may be, this is a nice change of pace. So my heartfelt thanks to the author, I laughed. :)
As someone who runs very simple crawler, I hope these actions will not affect me that much. I want to be able to collect data and be able to share it<p>Results of my crawling<p><a href="https://github.com/rumca-js/Internet-Places-Database" rel="nofollow">https://github.com/rumca-js/Internet-Places-Database</a>
Turn the tables by having your crawler send snippy emails to webmasters when their site slows down under your barrage. Try: “Your server failed to support our cutting-edge AI training. Please upgrade your pathetic infrastructure.” Blaming them for your bad behavior not only shifts responsibility but also proves your startup’s fearless attitude.
These types of satirical posts are great, and its great that they can not only be entertaining but also provide new information (I had never heard of TCP SACK).<p>P.S: all I have to say to this guy spamming HN at the moment is mentioned in this (great) article: GET over it.