科技回声

5 条评论

danieldk将近 11 年前

The article takes a very weird turn:If we need extra information about the words, we can use an additional data structure along with the MA-FSA. We can store it in a hash table.This is not necessary at all, since an acyclic finite state automaton can be used as a perfect hash function. If you sum the right language cardinalities of the transitions followed, you get a unique hash code (1..n, where n is the number of strings in the automaton). For efficiency, you can precompute the cardinalities in the transitions or states. You can then use a normal array for whatever data you want to associate with a word.With respect to parts 1 & 2 of the series, one can construct a Levenshtein automaton for a given word in linear time[1]. Such an automaton matches any string within the given edit distance of the word. You can then compute the intersection (language) of the Levenshtein automaton and the dictionary automaton to get all words within the given edit distance.(Another approach would be to make the dictionary a transducer that returns the words within the given edit distance. I tried this once, but the resulting automaton would be too large for realistic dictionaries.)I wrote a Java library for dictionary automata, perfect hash automata, and Levenshtein automata. It also provides immutable Map containers for String keys, that store strings in an automaton and uses a flat array to store values:<a href="https://github.com/danieldk/dictomaton" rel="nofollow">https://github.com/danieldk/dictomaton</a>[1] <a href="http://csi.ufs.ac.za/resres/files/Schultz.pdf" rel="nofollow">http://csi.ufs.ac.za/resres/files/Schultz.pdf</a>

评论 #7917495 未加载

afsina将近 11 年前

For what its worth, I have a quite fast MPHF implementation that uses an average of 3.2 bits per key.<a href="https://github.com/ahmetaa/zemberek-nlp/blob/master/core/src/main/java/zemberek/core/hash/MultiLevelMphf.java" rel="nofollow">https://github.com/ahmetaa/zemberek-nlp/blob/master/core/src...</a>it is used for language model compression. Also there is a Dart implementation as part of a small library.<a href="https://github.com/ahmetaa/poppy" rel="nofollow">https://github.com/ahmetaa/poppy</a>

TheCoreh将近 11 年前

I had the luck to have Fabiano Botelho — the person behind the Minimal Perfect Hash thesis — as an Algorithms and Data Structures professor back when I was studying Computer Engineering at CEFET-MG. He is absolutely brilliant, and has some incredible knowledge on both the theoretical and practical aspects of using hash tables.IIRC, he had just finished publishing his thesis, and had each of us write our own implementations of minimal perfect hashing for extra credit. The data structure is really useful for indexing large, immutable data sets.

thomasahle将近 11 年前

For what it's worth, if you have in the order of N things to store, and you hash them to a space of size N^3, the possibility for a collision is less than N^(-1). Basically nonexistent. Hence, if you want to save space, just use a normal hash table implementation, but use a more precise hash instead of the keys. This is very useful when your keys are document sized.

评论 #7918719 未加载

freeqaz将近 11 年前

Change that dang font!

评论 #7919229 未加载

5 条评论

danieldk将近 11 年前

评论 #7917495 未加载

afsina将近 11 年前

TheCoreh将近 11 年前

thomasahle将近 11 年前

评论 #7918719 未加载

freeqaz将近 11 年前

Change that dang font!

评论 #7919229 未加载

Minimal Perfect Hashing

5 条评论

Minimal Perfect Hashing

5 条评论