I think the focus on algorithms and data storage is totally misguided. You don't need to count every single word on the Internet. This is a question about statistical inference:<p>(1) Use Zipf's law (or a more suitable distribution) to identify how many times we expect the 10,000th word to occur in a random distribution of documents. That gives us the maximum interval at which the 10,000th word will repeat itself at any level of statistical confidence.<p>(2) Don't count every word. Only track the top 10,000 words plus a tail of whatever size is necessary to capture this maximum interval. With Zipf's law this means less than 20,000 words. The exact number can be calculated easily enough.<p>(3) Store words in a data structure that provides fast access by alphabetical search (tree) and also by frequency of occurrence (list). Bump words up the list as we run into them. When you run into a word that isn't in your tree, grab the entry at the bottom of the frequency list and re-insert it as your new word by repositioning. There is no need to traverse much data<p>(4) Lower frequency content will naturally drop off the list. Higher frequency content will naturally move up. And we've structured our program so that we can make statistical inferences about the words that are left....<p>(5) Remember to normalize incoming data with a Porter-Stemmer algorithm or something. Lots of small details like that, but it is really just icing on the cake.<p>Result? You get a statistically significant test you can run multiple times at whatever level of accuracy you desire. The software doesn't require huge computational muscle power or massive amounts of memory, since we throw out the long tail of statistically irrelevant content on an ongoing basis. That coupled with our elegant data structure - we're mostly swapping pointers - keeps the software small and fast.<p>Is there a better approach?