I’ve been puttering away at making a search engine of my own (I should really do a Show HN sometime); let’s see how my experience compares with 18 years ago:<p>Bandwidth: This is now also cheap; my residential service is 1 Gbit. However, the suggestion to wait until you’ve got indexing working well before optimizing crawling is IMO still spot-on; trying to make a polite, performant crawler that can deal with all the bizzare edge cases (<a href="https://memex.marginalia.nu/log/32-bot-apologetics.gmi" rel="nofollow">https://memex.marginalia.nu/log/32-bot-apologetics.gmi</a>) on the Web will drag you down. (I bypassed this problem by starting with the Stack Exchange data dumps and Wikipedia crawls, which are a lot more consistent than trying to deal with random websites.)<p>CPU: Computers are <i>really</i> fast now; I’m using a 2-core computer from 2014 and it does what I need just fine.<p>Disk: SATA is the new thing now, of course, but the difference these days is HDD vs SSD. SSD is faster: but you can design your architecture so that this mostly doesn’t matter, and even a “slow” HDD will be running at capacity. (The trick is to do linear streaming as much as possible, and avoid seeks at all costs.) Still, it’s probably a good idea to store your production index on an SSD, and it’s useful for intermediate data as well; by happenstance more than design I have a large HDD and a small SSD and they balance each other nicely.<p>Storing files: 100% agree with this section, for the disk-seek reasons I mention above. Also, pages from the same website often compress very well against each other (since they’re using the same templates, large chunks of HTML can be squished down considerably), so if you’re pressed for space consider storing one GZIPped file per domain. (The tradeoff with zipping is that you can’t arbitrarily seek, but ideally you’ve designed things so you don’t need to do that anyway.) Also, WARC is a standard file format that has a lot of tooling for this exact use case.<p>Networking: I skipped this by just storing everything on one computer; I expect to be able to continue doing this for a long time, since vertical scaling can get you <i>very</i> far these days.<p>Indexing: You basically don’t need to write <i>anything</i> to get started with this these days! I’m just using bog-standard Elasticsearch with some glue code to do html2text; it’s working fine and took all of an afternoon to set up from scratch. (That said, I’m not sure I’ll <i>continue</i> using Elastic: it has a ton of features I don’t need, which makes it very hard to understand and work with since there’s so much that’s irrelevant to me. I’m probably going to switch to either straight Lucene or Bleve soon.)<p>Page rank: I added pagerank very early on in the hopes that it would improve my results, and I’m not really sure how helpful it is if your results aren’t decent to begin with. However, the march of Moore’s law has made it an easy experiment: what Page and Brin’s server could compute in a week with carefully optimized C code, mine can do in less than 5 minutes (!) with a bit of JavaScript.<p>Serving: Again, ElasticSearch will solve this entire problem for you (at least to start with); all your frontend has to do is take the JSON result and poke it into an HTML template.<p>It’s easier than ever to start building a search engine in your own home; the recent explosion of such services (as seen on HN) is an indicator of the feasibility, and the rising complaints about Google show that the demand is there. Come and join us, the water’s fine!