I've done crawling at a small startup and I've done crawling at a big tech company. This is not crawling more politely than big tech.<p>There are a few things that stand out, like:<p>> I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function.<p>Could this end up DOS'ing or being "impolite" just in robots.txt requests?<p>All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.<p>When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.<p>Here at $BigTech, the systems for ensuring "polite", and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.<p>It doesn't surprise me that OpenAI and Amazon don't have great systems for this, both are new to the crawling world, but concluding that "Big Tech" doesn't do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.