And if you only need a monotone priority queue (i.e. priority queue where the popped elements are monotonically increasing/decreasing) you should consider using a radix heap. This monotone requirement can be satisfied more than you would expect when eg. using time as key or in pathfinding. I have a simple radix heap implementation here: <a href="https://github.com/mpdn/radix-heap" rel="nofollow">https://github.com/mpdn/radix-heap</a>
Like folks mentioned there, I wonder if a higher-fanout heap (people asked about 4-ary) might also do well in practice. Looking at Wikipedia (<a href="https://en.wikipedia.org/wiki/D-ary_heap" rel="nofollow">https://en.wikipedia.org/wiki/D-ary_heap</a>), it looks like maybe so -- the growth in number of comparisons isn't that bad, and it mentions 4-ary heaps working well specifically.<p>(Like how linear search wins for small N, or insertion sort helps with small subarrays in introsort: more steps, but each step is also a lot cheaper on modern hardware due to fewer branch mispredictions, better cache locality, or something else that better fits the hardware's strengths.)
I have used a min-max heap once. I don't remember why I needed it at the time--it was a previous job--but I do remember that I had to roll my own, because it's just not that popular of a data structure, and it was the obvious and only good solution to the problem at the time.<p>So, it's nice to see a detailed analysis of the structure like this! Perhaps if it becomes more popular, I will find more places to use it.
> "<i>...C++20 finally added a way of accessing this instruction, using std::countl_zero. I don’t have C++20 though so I had to do it the old platform specific ways ...</i>"<p>You don't need C++20. Even C++98 had std::bitset<>::count(), which has a nice conversion from unsigned long, and which, when compiled for SSE3 or better (e.g. Core2), uses a POPCNT instruction. It is pretty simple to produce various other results, including countl_zero, from a popcount, with just a couple of additional bitwise ops.<p>Modern compilers are happy to keep a bitset<32> in the same register as an unsigned long, so the conversion takes exactly zero cycles. POPCNT takes three cycles, and the extra bitwise ops another couple.
In certain domains, the trend has been to give up constant factors in order to increase programmer productivity (e.g., python pays a 10x slowdown but is batteries included).<p>So in that case I would use this data structure even if it weren't faster. I can't count the number of times I have had to mess with inserting negative priorities into a min heap to create a max heap! We should just have one data structure that does both.<p>(though taking this idea to the logical extreme means we should just use Order Statistic Tree for everything since it not only gives you log(n) min/max, but also log(n) find kth and get rank of x)
If you need a min-max heap (or double ended heap) in Go here is one I've used: <a href="https://github.com/aalpar/deheap" rel="nofollow">https://github.com/aalpar/deheap</a><p>Very useful when you need it!
It would be neat to fire this up on an older processor which doesn’t have modern instruction-level parallelism and verify the difference in performance
I've been using the author's flat_hash_map (<a href="https://github.com/skarupke/flat_hash_map" rel="nofollow">https://github.com/skarupke/flat_hash_map</a>) for several years now and have been really impressed. I've yet to find a single bug, it's nearly as fast as folly's hash maps (for my use case anyway) but far easier to integrate than folly.