> "we never had to deal with sharding the index across multiple servers or a hash table of words that wouldn’t fit in memory. Speaking of which, I’d love to read some papers on how internet-scale search engines actually do that. Does anyone have any recommendations?"<p>I can speak from experience on a very large search engine (not on a google scale in # of docs, but within an order of magnitude - and google scale in terms of qps [estimated - google doesn't publish such numbers])<p>Re: "sharding the index across multiple servers" - every document has an id, mod the id against some number (preferably much larger than the number of partitions/shards you have), split you index servers into N clusters, assign mods to a cluster (do so in a way that you avoid hotspots), have a "query aggregator" that sends an incoming query to one server in every partition. the aggregator then merges the result sets and resorts based on a sort key passed by the search node.<p>Re: "hash table of words that wouldn’t fit in memory" - the vocabulary I had to work with included at least 7 (human) languages with _many_ artificial words. The # of hash entries tended to hover around 2.7M tokens. How, do not include numbers in the index (there's an infinite number of them :)), ignore case, and tokenization. Tokenization is relatively easy except for CJK languages for that either have fluent/native speakers define the tokenizing semantics or find/buy a library.