TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Improved ways to operate a rude crawler

75 pointsby doruk1012 months ago

7 comments

cobbzilla2 months ago
I had to lock-down my private Gitea server when I noticed my commits were taking forever, because my meager 2-CPU instance was pegged.<p>Tail the nginx logs, sure enough some jerk is asking for every URL for every git commit ever done, with no delays&#x2F;backoffs&#x2F;anything. Just hammer the ever-loving crap out of me. Lovely, GTFO!<p>The simplest thing to do: add HTTP Basic auth, now my git server is no longer accessible to the public. Thanks AI startups! Maybe I&#x27;ll re-enable after this craze is over.
jsheard2 months ago
Don&#x27;t forget to set &quot;Accept-Encoding&quot; to &quot;identity&quot;, you wouldn&#x27;t want to waste valuable CPU cycles on decompression. You need those for training!
grayhatter2 months ago
&gt; This text is satirical in nature.<p>I usually reject this prefix&#x2F;postfix as damaging to the spirit of the post. Ruining the art as it were.<p>Unfortunately, I think in this case, it&#x27;s required. I run a novel git host and I&#x27;ve seen bots who lie about their UA crawl exclusively code. Ignoring the git history only following links with an extension. It&#x27;s a git host, if you choose to crawl the web interface instead of cloning the repo, you&#x27;re too stupid to also pick up this is satire, and would likely follow the other suggestions intentionally. Same goes for those bots that crawl wikipedia instead of downloading one of the prepackaged archives. Bot authors: &quot;Why are you the way you are?&quot;<p>It&#x27;s refreshing to read some humor about the state of things. There&#x27;s too much frustration, vitriol and anger. Justified as it may be, this is a nice change of pace. So my heartfelt thanks to the author, I laughed. :)
renegat0x02 months ago
As someone who runs very simple crawler, I hope these actions will not affect me that much. I want to be able to collect data and be able to share it<p>Results of my crawling<p><a href="https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;Internet-Places-Database" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;Internet-Places-Database</a>
czk2 months ago
Turn the tables by having your crawler send snippy emails to webmasters when their site slows down under your barrage. Try: “Your server failed to support our cutting-edge AI training. Please upgrade your pathetic infrastructure.” Blaming them for your bad behavior not only shifts responsibility but also proves your startup’s fearless attitude.
mmsc2 months ago
These types of satirical posts are great, and its great that they can not only be entertaining but also provide new information (I had never heard of TCP SACK).<p>P.S: all I have to say to this guy spamming HN at the moment is mentioned in this (great) article: GET over it.
评论 #43446235 未加载
评论 #43446011 未加载
评论 #43446045 未加载
acrophiliac2 months ago
&quot;You want that new TCP handshake smell&quot; really got to me.