TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Crawling More Politely Than Big Tech

43 点作者 pkghost5 个月前

9 条评论

Mistletoe5 个月前
Was all of our posting on the net on forums, HN, Reddit, digg, Slashdot, etc. just to train the AI of the future? I think about this a lot. AI has that "annoying forum poster" tone to everything and now I can't unsee it when I (rarely) use it. Maybe I'm just post-internet. I've been thinking about that a lot also. I'm tired of 99.75% of the internet.
评论 #42606808 未加载
评论 #42611409 未加载
danpalmer5 个月前
I&#x27;ve done crawling at a small startup and I&#x27;ve done crawling at a big tech company. This is not crawling more politely than big tech.<p>There are a few things that stand out, like:<p>&gt; I fetch all robots.txts for given URLs in parallel inside the queue&#x27;s enqueue function.<p>Could this end up DOS&#x27;ing or being &quot;impolite&quot; just in robots.txt requests?<p>All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.<p>When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.<p>Here at $BigTech, the systems for ensuring &quot;polite&quot;, and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.<p>It doesn&#x27;t surprise me that OpenAI and Amazon don&#x27;t have great systems for this, both are new to the crawling world, but concluding that &quot;Big Tech&quot; doesn&#x27;t do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.
评论 #42610507 未加载
Aloisius5 个月前
I think a default max of 1 request every 5 seconds is unnecessarily meek, especially for larger sites. I&#x27;d also argue that requests that browsers don&#x27;t slow down for, like following redirects to the same domain or links with the prefetch attribute, don&#x27;t really necessitate a delay at all.<p>If you can detect a site has a CDN, metrics like time-to-first-byte are low and stable and&#x2F;or you&#x27;re getting cache control headers indicating you&#x27;re mostly getting cached pages, I see no reason why one shouldn&#x27;t speed up - at least for domains with millions of URLs.<p>I disagree with using HEAD requests for refreshing. A HEAD request is rarely cheaper and sometimes more expensive for some websites than a GET If-Modified-Since&#x2F;If-None-Match. Besides, you&#x27;re going to fetch the page anyway if it changed, so why issue two requests when you could do one?<p>Having a single crawler per process&#x2F;thread makes rate limiting easier, but it can lead to some balancing and under-utilization issues with distributed crawling due to the massive variation in URLs per domain and site speeds, especially if you use something like a hash to distribute them. For Commoncrawl, I had something that monitored utilization and shut down crawler instances which would redistribute URLs pending from the machines shutting down to the machines left (we were doing it on a shoestring budget using AWS spot instances, so it had to survive instances going down randomly anyway).<p>I&#x27;d say one of the best polite things to do when crawling is to add a URL to the crawler user agent pointing to a page explaining what it is and maybe letting people opt-out or explain how to update their robots.txt to let them out-out.
registeredcorn5 个月前
I&#x27;m just beginning to learn about curl and wget. Can anyone recommend similar resources to this one that emphasize politeness?<p>For example, I&#x27;d like to grab quite a few books from archive.org, but want to use their torrent option, when available. I don&#x27;t like the idea of &quot;slamming&quot; their site because I&#x27;m trying to grab 400 books at once.
评论 #42606708 未加载
pkghost5 个月前
A few implementation details from building a hobby crawler
ndriscoll5 个月前
If you have cache headers, why use HEAD? Are servers more likely to handle HEAD correctly than including them on the GET?
评论 #42606613 未加载
thiago_fm5 个月前
I doubt big tech cares enough if they are doing this to a website. They just want to fiercely battle the competition and make profits
dsymonds5 个月前
If the author reads this, you have a misspelling of &quot;diaspora&quot; in the first sentence.
评论 #42606235 未加载
Karupan5 个月前
This is timely as I’m just building out a crawler in scrapy. Thanks!