科技回声

9 条评论

Was all of our posting on the net on forums, HN, Reddit, digg, Slashdot, etc. just to train the AI of the future? I think about this a lot. AI has that "annoying forum poster" tone to everything and now I can't unsee it when I (rarely) use it. Maybe I'm just post-internet. I've been thinking about that a lot also. I'm tired of 99.75% of the internet.

评论 #42606808 未加载

评论 #42611409 未加载

danpalmer5 个月前

I've done crawling at a small startup and I've done crawling at a big tech company. This is not crawling more politely than big tech.There are a few things that stand out, like:> I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function.Could this end up DOS'ing or being "impolite" just in robots.txt requests?All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.Here at $BigTech, the systems for ensuring "polite", and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.It doesn't surprise me that OpenAI and Amazon don't have great systems for this, both are new to the crawling world, but concluding that "Big Tech" doesn't do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.

评论 #42610507 未加载

Aloisius5 个月前

I think a default max of 1 request every 5 seconds is unnecessarily meek, especially for larger sites. I'd also argue that requests that browsers don't slow down for, like following redirects to the same domain or links with the prefetch attribute, don't really necessitate a delay at all.If you can detect a site has a CDN, metrics like time-to-first-byte are low and stable and/or you're getting cache control headers indicating you're mostly getting cached pages, I see no reason why one shouldn't speed up - at least for domains with millions of URLs.I disagree with using HEAD requests for refreshing. A HEAD request is rarely cheaper and sometimes more expensive for some websites than a GET If-Modified-Since/If-None-Match. Besides, you're going to fetch the page anyway if it changed, so why issue two requests when you could do one?Having a single crawler per process/thread makes rate limiting easier, but it can lead to some balancing and under-utilization issues with distributed crawling due to the massive variation in URLs per domain and site speeds, especially if you use something like a hash to distribute them. For Commoncrawl, I had something that monitored utilization and shut down crawler instances which would redistribute URLs pending from the machines shutting down to the machines left (we were doing it on a shoestring budget using AWS spot instances, so it had to survive instances going down randomly anyway).I'd say one of the best polite things to do when crawling is to add a URL to the crawler user agent pointing to a page explaining what it is and maybe letting people opt-out or explain how to update their robots.txt to let them out-out.

registeredcorn5 个月前

I'm just beginning to learn about curl and wget. Can anyone recommend similar resources to this one that emphasize politeness?For example, I'd like to grab quite a few books from archive.org, but want to use their torrent option, when available. I don't like the idea of "slamming" their site because I'm trying to grab 400 books at once.

评论 #42606708 未加载

pkghost5 个月前

A few implementation details from building a hobby crawler

ndriscoll5 个月前

If you have cache headers, why use HEAD? Are servers more likely to handle HEAD correctly than including them on the GET?

评论 #42606613 未加载

thiago_fm5 个月前

I doubt big tech cares enough if they are doing this to a website. They just want to fiercely battle the competition and make profits

dsymonds5 个月前

If the author reads this, you have a misspelling of "diaspora" in the first sentence.

评论 #42606235 未加载

Karupan5 个月前

This is timely as I’m just building out a crawler in scrapy. Thanks!

9 条评论

Mistletoe5 个月前

评论 #42606808 未加载

评论 #42611409 未加载

danpalmer5 个月前

评论 #42610507 未加载

Aloisius5 个月前

registeredcorn5 个月前

评论 #42606708 未加载

pkghost5 个月前

A few implementation details from building a hobby crawler

ndriscoll5 个月前

If you have cache headers, why use HEAD? Are servers more likely to handle HEAD correctly than including them on the GET?

评论 #42606613 未加载

thiago_fm5 个月前

I doubt big tech cares enough if they are doing this to a website. They just want to fiercely battle the competition and make profits

dsymonds5 个月前

If the author reads this, you have a misspelling of "diaspora" in the first sentence.

评论 #42606235 未加载

Karupan5 个月前

This is timely as I’m just building out a crawler in scrapy. Thanks!

Crawling More Politely Than Big Tech

9 条评论

Crawling More Politely Than Big Tech

9 条评论