> It’s you who chooses what sites we crawl<p>Yeah, but you still reserve the right to <i>not</i> crawl sites (or to remove them from your index), yes? So there's still the <i>opportunity</i> to do evil.<p>I'm still waiting for a "raw" search spidering provider. One that:<p>1. runs a web-spidering cluster — one that's only smart enough to know what robots.txt is, to know how to follow links in HTML pages, and to obey response caching-policy headers;<p>2. captures the spidering process losslessly, as e.g. HAR transcript files;<p>3. packs those HAR transcript files, a few million at a time, into tar.xz.tar files (i.e. grab a "chunk" of N HAR files; group them into subdirs by request Host header; archive each subdir, and compress those archives independently; then archive all the compressed archives without compression) — and then uploads these semi-random-access archives to a CDN or private BitTorrent tracker (or any other data delivery system that enables clients to only retrieve the blocks/byte-ranges of files they're interested in);<p>4. generate a TOC for the semi-random-access files, as a stream of tuples (signed archive URL, chunk byte-range, hostname, compressed URL-list); push these to a managed reliable message queue on an IaaS, publishing each entry to both an all-hostnames topic, and a per-hostname topic. (I say an IaaS, as this allows consumers to set up their own consumer-groups on these topics within their own IaaS project, and then pay the costs of message retention in these consumer-groups themselves.)<p>5. Also buffer these TOC-entry streams into files (e.g. Parquet files), one archive series per topic; and host these alongside the HAR archives. Prune TOC topic stream entries if (entries are at least N days old AND the entries have been successfully "offlined" into a hosted TOC-stream archive.)<p>---<p>This "web-spidering-firehose data-lake as-a-Service" architecture, would enable pretty much anyone to build whatever arbitrary search <i>index</i> they want downstream of it, containing as much or as little of the web as they want — where each consumer only needs to do as much work as is required to fetch and parse the HARs of the domains they've decided they care about indexing something under.<p>This architecture would also be "temporal" (akin to a temporal RDBMS table) — as a consumer of this service, you wouldn't see "the current version" of a scraped URL, but rather <i>all previous attempts to scrape that URL, and what happened each time</i>. (This would mean that no website could ever censor the dataset retroactively by adding a robots.txt "Disallow *" <i>after</i> scrapes have already happened. Their robots.txt config would prevent <i>further</i> scraping, but <i>previous</i> scraping would be retained.)<p>And in fact, in this architecture, the HTTP interaction <i>to retrieve /robots.txt for a domain</i>, would produce a HAR transcript that would get archived like any other. Domains restricted from crawling by robots.txt, would still get regular HAR transcripts recorded <i>of the result of checking that their /robots.txt still restricts crawling</i>. (Reducing over these /robots.txt HAR transcripts is how a consumer-indexer would determine whether they should currently be showing/hiding a domain in their built index.)