I recently completed a small web crawl including <i>only</i> pages linked from HN stories. For example I would crawl the lwn article from https://news.ycombinator.com/item?id=32614633. I used archive.org's wayback machine to fetch their copy nearest to the HN submission's timestamp. If archive didn't have a copy, I did a direct fetch. It's about 2.5 million pages.<p>I would like to publish it for others to use, but I'm not sure how useful it would be. In the HN spirit of validating customers early, I'd like to gauge interest <i>of those would actually download and use such a resource</i> before moving forward. Let me know.