I've heard of two possibilities (as a recommendation) for using data structures in C without having to develop them:<p>- BSD queue.h offers linked lists, queues, etc [1].<p>- UT-hash offers hash tables [2].<p>I've also read some people being frustrated with them, probably due to the quirky syntax/API. On the plus side (BSD at least) it's just a header file that you can download, include and start using.<p>[1] <a href="http://bxr.su/OpenBSD/sys/sys/queue.h" rel="nofollow">http://bxr.su/OpenBSD/sys/sys/queue.h</a><p>[2] <a href="https://troydhanson.github.io/uthash/" rel="nofollow">https://troydhanson.github.io/uthash/</a>
"The language doesn't come with [a hash table implementation] included"<p>The standard library does come with a half-baked hash table implementation[1] (hcreate/hdestroy/hsearch) that only allows creation of _one_ global hash table. GNU libc[2] adds hcreate_r/hdestroy_r/hsearch_r that allows multiple tables to be created.<p>The APIs are strange and antiquated. On the other hand, the hash table implementation described in the article presents a much nicer interface.<p>[1] <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/hcreate.html" rel="nofollow">http://pubs.opengroup.org/onlinepubs/009695399/functions/hcr...</a><p>[2] <a href="http://man7.org/linux/man-pages/man3/hsearch.3.html" rel="nofollow">http://man7.org/linux/man-pages/man3/hsearch.3.html</a>
The link at the bottom of the intro is broken: <a href="https://github.com/jamesroutley/write-a-hash-table/blob/v0.1.0/hash-table" rel="nofollow">https://github.com/jamesroutley/write-a-hash-table/blob/v0.1...</a> preventing the reader from advancing.<p>I think the whole thing would be best as a single page, to be honest. Even if it's long, people have scroll wheels. But it's a stylistic choice.
"(long)pow(a, len_s - (i+1))" is both inefficient and inaccurate. You should store the current power in a variable which you multiply by 'a' and reduce by 'm' on each iteration.<p>Also, your double hashing scheme has a bug. You identified that it's problematic if hash_b returns 0, but adding 1 to the result doesn't actually solve anything, because now you have the same problem if hash_b returns num_buckets-1.
Weird there isn't a src directory collecting everything together<p>Hashtables are one of those data structures one can easily get away with never having to write on their own, yet still use every day. In scripting languages it's impractical to roll your own, & in the more abstracted low level languages there'll usually be a pretty good generic version. Yet there's much room for variation<p>I had the joy to be out of reach of off-the-shelf hashtables for luwa,<p>init: <a href="https://github.com/serprex/luwa/blob/436c7271120fcd116af30e906919afe7a7e55254/rt/alloc.lua#L156" rel="nofollow">https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...</a><p>get/set: <a href="https://github.com/serprex/luwa/blob/436c7271120fcd116af30e906919afe7a7e55254/rt/table.lua" rel="nofollow">https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...</a><p>hash: <a href="https://github.com/serprex/luwa/blob/436c7271120fcd116af30e906919afe7a7e55254/rt/obj.lua#L130" rel="nofollow">https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...</a><p>Still need to implement Lua's sequential-portion-as-array logic. I do like the API not having a 'remove' method, opting intead for nil by default. nil value entries get vacuumed out when rehashing. This ends up acting as tombstone. In an earlier iteration I was using 0 as a marker distinct from nil, but an implementation detail is that nil is stored at address 0.. oops
If you just want to see hash tables work as a data structure, here's a talk-as-blog-post I made on the subject.<p><a href="http://nathanmlong.com/2015/10/reimplementing-rubys-hash/" rel="nofollow">http://nathanmlong.com/2015/10/reimplementing-rubys-hash/</a>
If your set is known at compile time, make a Perfect Hash Table with gperf instead. It's commonly used in compilers and other software with a parsing or interpreting stage.
In the intro section, on extending to unicode being non-trivial...is the argument that if your input has some non-ascii data you probably have all non-ascii data and the above 127 bucket will be too full? Honest question.
While the reasons for this that the author gave are valid, I think it would also be valuable to do this in a high level language. I'm sure lot of people who can read C code went through a CS program, and hopefully learned how hash tables work then. People without CS degrees and who mostly write JS/Ruby/Python/etc. and use hash tables on almost every line of code they write are much less likely to know how they work. Just saying, someone should do it for other languages.
Don't allocate these items individually; don't store pointers to such items. They're tiny, and the pointer chasing + allocator bookkeeping overhead + potential randomisation make it inefficient for no gain. Just allocate an array of items. This way you'll also deal with fewer potential errors. Speaking of errors, please handle them and fix the API so that the caller can know if there's a problem.
This might seem academic, but I actually had a real reason to hand-roll a hash table implementation that would work in process-shared memory (as an Apache module). It was based on the hash table implemented in mod_ldap, and was part of an LRU cache.
This is a good mostly simple overview. The appendix should also mention robin hood hashing as a method for addressing collisions. From my understanding it is cpu cache friendly and works well in practice.
Link should be changed to <a href="https://github.com/jamesroutley/write-a-hash-table" rel="nofollow">https://github.com/jamesroutley/write-a-hash-table</a>