TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Improved ways to operate a rude crawler

75 点作者 doruk1012 个月前

7 条评论

cobbzilla2 个月前
I had to lock-down my private Gitea server when I noticed my commits were taking forever, because my meager 2-CPU instance was pegged.<p>Tail the nginx logs, sure enough some jerk is asking for every URL for every git commit ever done, with no delays&#x2F;backoffs&#x2F;anything. Just hammer the ever-loving crap out of me. Lovely, GTFO!<p>The simplest thing to do: add HTTP Basic auth, now my git server is no longer accessible to the public. Thanks AI startups! Maybe I&#x27;ll re-enable after this craze is over.
jsheard2 个月前
Don&#x27;t forget to set &quot;Accept-Encoding&quot; to &quot;identity&quot;, you wouldn&#x27;t want to waste valuable CPU cycles on decompression. You need those for training!
grayhatter2 个月前
&gt; This text is satirical in nature.<p>I usually reject this prefix&#x2F;postfix as damaging to the spirit of the post. Ruining the art as it were.<p>Unfortunately, I think in this case, it&#x27;s required. I run a novel git host and I&#x27;ve seen bots who lie about their UA crawl exclusively code. Ignoring the git history only following links with an extension. It&#x27;s a git host, if you choose to crawl the web interface instead of cloning the repo, you&#x27;re too stupid to also pick up this is satire, and would likely follow the other suggestions intentionally. Same goes for those bots that crawl wikipedia instead of downloading one of the prepackaged archives. Bot authors: &quot;Why are you the way you are?&quot;<p>It&#x27;s refreshing to read some humor about the state of things. There&#x27;s too much frustration, vitriol and anger. Justified as it may be, this is a nice change of pace. So my heartfelt thanks to the author, I laughed. :)
renegat0x02 个月前
As someone who runs very simple crawler, I hope these actions will not affect me that much. I want to be able to collect data and be able to share it<p>Results of my crawling<p><a href="https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;Internet-Places-Database" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;Internet-Places-Database</a>
czk2 个月前
Turn the tables by having your crawler send snippy emails to webmasters when their site slows down under your barrage. Try: “Your server failed to support our cutting-edge AI training. Please upgrade your pathetic infrastructure.” Blaming them for your bad behavior not only shifts responsibility but also proves your startup’s fearless attitude.
mmsc2 个月前
These types of satirical posts are great, and its great that they can not only be entertaining but also provide new information (I had never heard of TCP SACK).<p>P.S: all I have to say to this guy spamming HN at the moment is mentioned in this (great) article: GET over it.
评论 #43446235 未加载
评论 #43446011 未加载
评论 #43446045 未加载
acrophiliac2 个月前
&quot;You want that new TCP handshake smell&quot; really got to me.