TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Minimal Perfect Hashing

76 点作者 fendrak将近 11 年前

5 条评论

danieldk将近 11 年前
The article takes a very weird turn:<p><i>If we need extra information about the words, we can use an additional data structure along with the MA-FSA. We can store it in a hash table.</i><p>This is not necessary at all, since an acyclic finite state automaton can be used as a perfect hash function. If you sum the right language cardinalities of the transitions followed, you get a unique hash code (1..n, where n is the number of strings in the automaton). For efficiency, you can precompute the cardinalities in the transitions or states. You can then use a normal array for whatever data you want to associate with a word.<p>With respect to parts 1 &amp; 2 of the series, one can construct a Levenshtein automaton for a given word in linear time[1]. Such an automaton matches any string within the given edit distance of the word. You can then compute the intersection (language) of the Levenshtein automaton and the dictionary automaton to get all words within the given edit distance.<p>(Another approach would be to make the dictionary a transducer that returns the words within the given edit distance. I tried this once, but the resulting automaton would be too large for realistic dictionaries.)<p>I wrote a Java library for dictionary automata, perfect hash automata, and Levenshtein automata. It also provides immutable Map containers for String keys, that store strings in an automaton and uses a flat array to store values:<p><a href="https://github.com/danieldk/dictomaton" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;danieldk&#x2F;dictomaton</a><p>[1] <a href="http://csi.ufs.ac.za/resres/files/Schultz.pdf" rel="nofollow">http:&#x2F;&#x2F;csi.ufs.ac.za&#x2F;resres&#x2F;files&#x2F;Schultz.pdf</a>
评论 #7917495 未加载
afsina将近 11 年前
For what its worth, I have a quite fast MPHF implementation that uses an average of 3.2 bits per key.<p><a href="https://github.com/ahmetaa/zemberek-nlp/blob/master/core/src/main/java/zemberek/core/hash/MultiLevelMphf.java" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ahmetaa&#x2F;zemberek-nlp&#x2F;blob&#x2F;master&#x2F;core&#x2F;src...</a><p>it is used for language model compression. Also there is a Dart implementation as part of a small library.<p><a href="https://github.com/ahmetaa/poppy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ahmetaa&#x2F;poppy</a>
TheCoreh将近 11 年前
I had the luck to have Fabiano Botelho — the person behind the Minimal Perfect Hash thesis — as an Algorithms and Data Structures professor back when I was studying Computer Engineering at CEFET-MG. He is absolutely brilliant, and has some incredible knowledge on both the theoretical and practical aspects of using hash tables.<p>IIRC, he had just finished publishing his thesis, and had each of us write our own implementations of minimal perfect hashing for extra credit. The data structure is really useful for indexing large, immutable data sets.
thomasahle将近 11 年前
For what it&#x27;s worth, if you have in the order of N things to store, and you hash them to a space of size N^3, the possibility for a collision is less than N^(-1). Basically nonexistent. Hence, if you want to save space, just use a normal hash table implementation, but use a more precise hash instead of the keys. This is very useful when your keys are document sized.
评论 #7918719 未加载
freeqaz将近 11 年前
Change that dang font!
评论 #7919229 未加载