TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: A portable hash map in C

176 点作者 e-dant5 个月前

17 条评论

gritzko5 个月前
I recently coded a linear-probing hash map in C and I highly recommend you to use fuzz tests. These evolve naturally from unit tests: next step table unit tests, next step property test, next step fuzzing, all steps are incremental, hence easy.<p>Having a fuzz test was in-va-lu-ab-le. I caught several bugs right there. The delete function was particularly tricky.<p>I ended up with two fuzz tests as my hash table has a key feature: convergence. Having same contents, it would have exactly same bits in the buffer. In other words, it is independent of the insertion&#x2F;deletion order. For this, I added another fuzz test. I would add a third one if I realize there is an important invariant I did not fuzz test. That is not much work, but so much useful!<p><a href="https:&#x2F;&#x2F;github.com&#x2F;gritzko&#x2F;librdx&#x2F;blob&#x2F;master&#x2F;abc&#x2F;fuzz&#x2F;HASH.cpp">https:&#x2F;&#x2F;github.com&#x2F;gritzko&#x2F;librdx&#x2F;blob&#x2F;master&#x2F;abc&#x2F;fuzz&#x2F;HASH....</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;gritzko&#x2F;librdx&#x2F;blob&#x2F;master&#x2F;abc&#x2F;fuzz&#x2F;HASHd.c">https:&#x2F;&#x2F;github.com&#x2F;gritzko&#x2F;librdx&#x2F;blob&#x2F;master&#x2F;abc&#x2F;fuzz&#x2F;HASHd...</a><p>P.S. In your case, I recommend clang&#x27;s libfuzzer with ASAN.<p><a href="https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;LibFuzzer.html#fuzzer-usage" rel="nofollow">https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;LibFuzzer.html#fuzzer-usage</a>
评论 #42364808 未加载
评论 #42362138 未加载
drfuchs5 个月前
A bug, I believe: If you &quot;put&quot; three colliding strings A and then B and then C, and then &quot;delete&quot; B, you won&#x27;t be able to find C anymore.
评论 #42360455 未加载
eqvinox5 个月前
Here&#x27;s some food-for-thought questions:<p>First, a correctness question:<p>- if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?<p>And then some integration questions:<p>- how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?<p>- how do you make this typesafe for its keys and values, i.e. remove accident-prone casts on read access?<p>- how do you not call the compare and hash functions through function pointers? (Calling through function pointers is both performance relevant as well as a target for exploitation by overwriting the pointer)<p>(None of these have &quot;nice&quot; answers. But thinking about them is IMHO extremely valuable since these are the most challenging in organizing and engineering an actual codebase itself.)
评论 #42370218 未加载
jll295 个月前
Thanks for sharing. Simplicity is a good thing.<p>In order to speed it up by reducing the number of malloc() calls, it may be worth adding a simple arena memory allocation measure: by allocating one larger block (e.g. 1 MB) initially and then doubling the memory size each time you run out, all malloc()&#x2F;calloc() calls can become local salmagundi_alloc() calls that are just macro invocations that return an arena buffer pointer and also increment said pointer as a side effect.<p>I also recommend you have a look at Chris Hanson&#x27;s book &quot;C: Interfaces and Implementations&quot;, which has a few neat C API tricks that your code could benefit from (e.g. for reducing name space pollution, for avoiding null pointer argument errors, API method naming etc.).
评论 #42362436 未加载
teo_zero5 个月前
Row 100 of hm_put() clears all contents of the item struct, thus losing the pointer to where the previous value was stored. But then row 107 needs this pointer to pass it to realloc().<p>Additionally, in case of overwrite, the memory previously allocated for the key is leaked.<p>All in all, I don&#x27;t think row 100 is really needed, and removing it wouldn&#x27;t do any harm.<p>Finally, deletes are hard in hash maps. Either you decide that deletes are not allowed, or you implement them with tombstones or re-insertions. A half-backed solution is not an option.<p>Hope these suggestions help you.
评论 #42366227 未加载
评论 #42364753 未加载
codr75 个月前
Considering there seems to be at least a one memory bug (hm_put&#x2F;key from another comment), I would strongly recommend running the tests in valgrind or similar if you haven&#x27;t already. Doing C without it usually ends in some kind of disaster for me.
评论 #42370666 未加载
vintagedave5 个月前
This is neat, and remarkably small. I personally need more comments in order to follow, say, the growth logic. I see it rehashes but I don&#x27;t see how it doesn&#x27;t potentially overwrite entries on the way.<p>Algorithm for hash collisions is just to find the next empty slot (linear probing)? What happens when the original entry is removed, are the ones in subsequent slots cycled backwards? I don&#x27;t see that...<p>Also the name is new to me! TIL &#x27;salmagundi&#x27; is an English word (not even a loan-word) for a type of salad made of many ingredients: an excellent name for a hash map that can contain many effectively random items.
评论 #42360267 未加载
评论 #42365133 未加载
cozzyd5 个月前
I&#x27;m curious about the #ifdef guard, which is of a form I&#x27;ve never encountered.<p><pre><code> #ifndef BD9DF82A4540BB19368E48E4747C0706 #define BD9DF82A4540BB19368E48E4747C0706 </code></pre> Does some IDE do that for you? Did you just pick a random sequence? Why not just a variant of #ifndef _salmagundi_h that is more common?<p>Other than that, I strongly echo other&#x27;s advice about avoiding so many mallocs.
评论 #42368368 未加载
评论 #42364118 未加载
评论 #42365372 未加载
tralarpa5 个月前
My C is quite rusty, so apologies for stupid questions.<p>In the hm_put function, when you overwrite, why do you malloc and copy the key again, and what happens with the old key pointer and the old value pointer? (no free for the key and the memset zeros everything?)
评论 #42361492 未加载
alanmoraes5 个月前
In the code example, I guess the variable `k_sz` is undefined in the call `hm_get(map, k, k_sz)`. Hope that helps.
ms95985 个月前
In reviewing hm_put, a few things stand out in the area of handling cases where memory allocation fails. The first problems are at lines 113-116 where memory is allocated and written to, then the allocations are validated in line 117 after they are written to with memcpy. An allocation failure would cause a crash. Secondly, more about semantics, is at line 105 where item-&gt;v is free’d after it is found to be NULL, to me it is counter intuitive and unnecessary. Reading further (and back to the first area of concern) at lines 118-119, after an allocation failure, both are free’d which, to me, is acceptable and convenient. In light of that, it seems that line 105 is there for the sake of the reviewer who may not like the apparent asymmetry of the two sites.
e-dant5 个月前
I want to thank everyone here who pointed out deficiencies in this little library.<p>Good catches :)
评论 #42362738 未加载
EricRiese5 个月前
I followed the link in the docs to the page on the hash function. There&#x27;s a DJB-2 algorithm on that page, but no DJB-1
评论 #42360230 未加载
zabzonk5 个月前
Purely out of interest, and probably my bad, but what is:<p><pre><code> (void)k_sz; </code></pre> doing? I&#x27;ve seen fussy people doing:<p><pre><code> (void) printf( &quot;hello&quot; ); </code></pre> but not what you have written. As I say, probably ignorance on my part.
评论 #42360655 未加载
评论 #42360614 未加载
评论 #42360600 未加载
评论 #42360642 未加载
评论 #42360617 未加载
david2ndaccount5 个月前
If you’re interested in this, I wrote a blogpost on a simple c hash table without deletion.<p><a href="https:&#x2F;&#x2F;www.davidpriver.com&#x2F;c-hash-table.html" rel="nofollow">https:&#x2F;&#x2F;www.davidpriver.com&#x2F;c-hash-table.html</a>
TheDudeMan5 个月前
I suggest improving the probing a little. Something like<p><pre><code> idx = (idx + 1) % map-&gt;cap; </code></pre> becomes<p><pre><code> iterCount++; idx = (idx + iterCount) % map-&gt;cap;</code></pre>
评论 #42362253 未加载
quotemstr5 个月前
This web site is amazing. We have, on one hand, people explaining machine sentience and quantum breakthroughs, and on the other hand, we have people refusing to use &quot;#pragma once&quot; in C because it&#x27;s still too novel and doesn&#x27;t work in every compiler yet.<p>Never change, HN.