So tries are a really cool data structure and they're the natural fit for an autocomplete, but a while back I had to implement an autocomplete widget (over stock tickers) and it turns out that for many relatively small, dense data-sets, binary search over a sorted array with a linear scan to find suffixes is a much (10x+) faster approach.<p>The problem is that a trie is a very cache-loose data structure. Every node is connected to every other by pointers, and they may be spread all over main memory. Finding the first node usually involves chasing several pointers; enumerating each successive one involves chasing several more. Meanwhile, if you just had a sorted array, all subsequent results would be on the same page at least, and many would be on the same cache line.<p>YMMV, based on your data's size and shape. Sorted arrays usually work best when the full data set is small and dense (eg. stock tickers), while tries may work better for sparse data (eg. sentences typed in by users). Always benchmark before committing to an implementation.
When you have more than a handful of strings, you will want to return only a small number (k) of completions, preferably sorted by some pre-assigned score. Depending on how you order the trie, you can only get the first k completions by lexicographic ordering, which could be very far from the "best" completions.<p>A common approach to solve this is to store in each internal node the maximum score among all its descendants, and sort the children of each node by maximum score. This way by doing a best-first search with a heap you can efficiently retrieve the top-k completions by score.<p>If you then have millions of strings, it becomes tricky to do this in a space-efficient and cache-efficient way. I wrote a paper a few years ago with different space-time tradeoffs: <a href="http://www.di.unipi.it/~ottavian/files/topk_completion_www13.pdf" rel="nofollow">http://www.di.unipi.it/~ottavian/files/topk_completion_www13...</a>
You may also want to check this spelling correction algo. I'm using it along with trie. Original is in C# but ports are available.<p><a href="http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/" rel="nofollow">http://blog.faroo.com/2012/06/07/improved-edit-distance-base...</a>
Speaking of autocomplete, how would you design a data structure and declaration syntax to feed to the autocomplete/parsing system of a constrained CLI?<p>By "constrained CLI", I mean the kind that controls a router or other piece of equipment which has a well-defined (um, somewhat) syntax, as opposed to the somewhat more free-form aspect of the shell.<p>I ask, because one of my jobs had me occasionally working in the CLI declarations for a type of router. The cli commands could be in the form "interface foo enable", where "interface" could be shortened its shortest non-ambiguous form ("int" IIRC), and "foo" was a string corresponding to one of the interfaces on the system. Pressing TAB after "int" would automatically suggest the list of interfaces that exist on the system. This command had an opposite in the form "no interface foo enable" (yeah, prefixing "no" instead of postfixing "disable", though to be fair that may have been possible as well). A declaration of such a syntax was done with something along the line of:<p><pre><code> node(id=interface, type=keyword, keyword=interface, no_prefix=yes, next=int_name);
node(id=int_name, type=string, next=enable);
node(id=enable, type=keyword, keyword=enable);
</code></pre>
(with several extra bells and whistles to account for the int_name string referencing the existing interfaces on the system and to indicate that the command was only complete with "enable" at the end).<p>There were hundreds of declaration files containing lines like that, it was parsed by a humongous Perl script that output a 7MB source file containing the data structures fed to libinput. It was easily the slowest part of our compilation process, and Emacs would choke on that 7MB sourcefile.<p>It was terrible, but I never found examples of how to do it better. Anyone have any experience?
It seems you'd get more mileage out of a trie if you also compared it with a sort of zipper approach. The idea being that if I type "a", I should just focus the trie around "A". If I next type "l", then I focus the "l" around my new root, so I find all strings starting "Al". It doesn't change the complexity class of completion, but maybe the small optimisation is worth it.
I was watching this lecture last night about tries and suffix arrays. Pretty cool stuff!<p><a href="http://youtu.be/NinWEPPrkDQ" rel="nofollow">http://youtu.be/NinWEPPrkDQ</a>