The Trie: A Neglected Data Structure

99 pointsby bbeneschottover 11 years ago

17 comments

tmoertelover 11 years ago

The original article uses Java, which makes it harder to see how simple it can be to make and use prefix trees. For comparison, here's the Python code to create a prefix tree from a set of words:<pre><code> def prefix_tree(words): d = dict() for word in words: root = d for c in word: root = root.setdefault(c, {}) root['$$'] = {} # mark end of word with '$$' token return d </code></pre> For example:<pre><code> >>> prefix_tree('foo bar baz'.split()) {'b': {'a': {'r': {'$$': {}}, 'z': {'$$': {}}}}, 'f': {'o': {'o': {'$$': {}}}}} </code></pre> To query whether a word is present in a tree you basically compute<pre><code> reduce(dict.__getitem__, word, tree) </code></pre> and test whether the result (a forest) has an end-of-word '$$' tree.

leifover 11 years ago

Tries (actually pronounced "trees" or "Patricia trees") are very memory efficient compared to most alternatives, but really only shine in membership query workloads. If you actually need to return the stored element, you have to rebuild it for every query, which is probably too expensive if you have lots of queries, it would be better to store the whole object continuously somewhere in the data structure so you could return a reference to it.Even then, tries only win if the distribution is suitable to give you the memory efficiency you are hoping for. If you don't have many common prefixes, you're better off with just a hashtable.

评论 #6380774 未加载

评论 #6381622 未加载

评论 #6381030 未加载

评论 #6380689 未加载

评论 #6380866 未加载

chatmanover 11 years ago

A trie is a poor data structure in practice. There is no locality of reference that can be exploited: to lookup a single key, almost the entire length of the memory blocks occupied by the trie needs to be accessed, and almost at random (depending on the insertion technique). For keys inserted later on, the number of different blocks of memory accesses needed for a single key is proportional to the length of the key.A B-Tree is better due to better exploitation of locality of reference in memory.

评论 #6381439 未加载

评论 #6381115 未加载

foglemanover 11 years ago

Should mention that tries can be heavily compressed into DAWGs (Directed Acyclic Word Graphs) by eliminating common subtrees. I've used this in word games on iOS where I want a small memory footprint.P.S. I love tries!

评论 #6381057 未加载

tieTYTover 11 years ago

I think the article was well written. This is the standard example to use for a Trie. But I never run into any scenarios except for this that make me think, "A Trie is a good fit for this!" What are some other examples where Tries come in handy?

评论 #6382184 未加载

评论 #6381908 未加载

评论 #6381827 未加载

评论 #6381830 未加载

supoover 11 years ago

With a basic Trie, you always have this problem of how to keep the pointers at each node. Naively, you can:1) Keep an array of size equal to the alphabet size, indexed by the characters, holding pointers to successive nodes.- but this blows the memory to O(N * alphabet_size) where N is the input size2) Keep a linked list of pairs (character, pointer).- but this blows the lookup time to O(word_length * alphabet_size)3) Keep something like an STL map<character, pointer>.- still suboptimal complexity of O(word_length * log alphabet_size) + it complicates the code + it dramatically increases the constant time per operationOr you research a bit more and actually learn a way to properly implement a trie, called a Ternary Search Tree: <a href="http://en.wikipedia.org/wiki/Ternary_search_tree" rel="nofollow">http://en.wikipedia.org/wiki/Ternary_search_tree</a>

评论 #6381408 未加载

评论 #6381022 未加载

k3nover 11 years ago

John Resig did a really cool series[1][2][3] of deep-dive posts into optimization in JS for mobile, which covered tries (among others) -- it was the first I'd heard of them. Fascinating read.1. <a href="http://ejohn.org/blog/dictionary-lookups-in-javascript/" rel="nofollow">http://ejohn.org/blog/dictionary-lookups-in-javascript/</a>2. <a href="http://ejohn.org/blog/javascript-trie-performance-analysis/" rel="nofollow">http://ejohn.org/blog/javascript-trie-performance-analysis/</a>3. <a href="http://ejohn.org/blog/revised-javascript-dictionary-search/" rel="nofollow">http://ejohn.org/blog/revised-javascript-dictionary-search/</a>

dnrover 11 years ago

If you like tries, take a look at the double-array trie: it's a pretty nifty (though somewhat complicated) way to greatly reduce the space usage of tries by interleaving nodes in the same block of memory. It also makes it easier to serialize a trie, so you can quickly load a pre-built one instead of re-doing all the inserts, which is useful for static dictionaries.<a href="http://linux.thai.net/~thep/datrie/" rel="nofollow">http://linux.thai.net/~thep/datrie/</a>

评论 #6382713 未加载

munificentover 11 years ago

Tries haven't been entirely neglected. Quadtrees and octrees are their 2- and 3D analogues, and those are very common in games, renderers and other spatial simulations.

评论 #6381752 未加载

pathikritover 11 years ago

[DAWGs](<a href="http://en.wikipedia.org/wiki/Directed_acyclic_word_graph" rel="nofollow">http://en.wikipedia.org/wiki/Directed_acyclic_word_graph</a>) are a special kind of Trie where similar child trees are compressed into single parents. I extended modified DAWGs and came up with a nifty data structure called ASSDAWG (Anagram Search Sorted DAWG). The way this works is whenever a string is inserted into the DAWG, it is bucket-sorted first and then inserted and the leaf nodes hold an additional number indicating which permutations are valid if we reach that leaf node from root. This has 2 nifty advantages:Since I sort the strings before insertion and since DAWGs naturally collapse similar sub trees, I get high level of compression (e.g. "eat", "ate", "tea" all become 1 path a-e-t with a list of numbers at the leaf node indicating which permutations of a-e-t are valid). Searching for anagrams of a given string is super fast and trivial now as a path from root to leaf holds all the valid anagrams of that path at the leaf node using permutation-numbers.

anaphorover 11 years ago

I just used a Trie yesterday to implement some string matching that I needed to do in a tokenizer :) It's quite useful if you want to do string matching and you know ahead of time the exact list of strings you need to match with. My only issue with this article is that it explains them using Java, which I think just obscures the explanation too much, but I guess it's a Java blog.

评论 #6380892 未加载

评论 #6381058 未加载

tehwalrusover 11 years ago

The Dasher input method (aimed at people with disabilities who are able to enter data only by manipulating a pointer) uses a visualized trie with vertex areas proportional to linguistic probability:<a href="http://www.inference.phy.cam.ac.uk/dasher/DasherSummary2.html" rel="nofollow">http://www.inference.phy.cam.ac.uk/dasher/DasherSummary2.htm...</a>(Prof. Mackay, its author, uses it to teach information theory and compression, among other things.)

andrewparkerover 11 years ago

Implementing the Trie (on paper, in C) was about 50% of my final exam on my intro to CS undergrad class at Stanford. We had never seen Tries befor that point, but we had done a variety of trees, so we had the necessary prior experience. It was a fun exam and a very memorable experience.

评论 #6380996 未加载

评论 #6380987 未加载

awdaover 11 years ago

Tries have hardly been neglected. "Path-compressed" tries were very recently added to FreeBSD[0] for use in the vm subsystem.[0]: <a href="http://fxr.watson.org/fxr/source/sys/pctrie.h" rel="nofollow">http://fxr.watson.org/fxr/source/sys/pctrie.h</a>

sambeauover 11 years ago

Tries are great for caching URLs & other path-like data. I've found them to be faster (and smaller) than hashes for these purposes in the past.However, this was back in the days of tiny caches. I suspect they may have lost the edge they had in 2003.

评论 #6381038 未加载

nawitusover 11 years ago

> ~O(1) access timeWhat's the difference between O(1) and ~O(1)?

评论 #6380907 未加载

评论 #6380858 未加载

评论 #6381358 未加载

ape4over 11 years ago

This blog post appears to be to get the author's resume looked at. Seems to be working.

评论 #6381703 未加载