Do note that "custom allocator" can often just be "take what malloc gives you and carve it up yourself" rather than "interpose malloc"; there's usually no need to drop the system allocator completely. This lets you experiment with your own custom pools or arenas without having to go all in and make something general-purpose that works for every object in your program.<p>> Over the years, I’ve wasted many hours debugging memory issues. With just a hash table and a set of linked lists, you can track the history of all your allocations. That makes it pretty easy to track down use after free errors, double free errors, and memory leaks. Using guard pages around your allocations, you can detect overflows.<p>No, just use address sanitizer–it's meant for this. Odds are your custom allocator might have a few bugs in itself and they will make you life more annoying if you're trying to use it for only this.
One of these things is not like the others.<p>Slabs and bump/arena allocators have significant advantages over just using the system allocator. You use them in a constrained way which allows for a more efficient implementation—slabs by having only one size of allocation and bump/arena by freeing things all at once (or maybe in LIFO order if you're fancy). Thus, they make sense in a lot of performance-oriented programs.<p>A buddy allocator has no such constraints. It's general-purpose. Thus, there's no obvious opportunity to do better than the system allocator. And it's pretty wasteful if your allocations aren't close to sizes of 2 (internal fragmentation => wasted RAM and cache). If you have a system allocator available that uses say size classes instead, I have no idea why you'd prefer to use a buddy allocator. For that matter, if your system allocator is a buddy allocator (or more likely: some kind of hybrid), I have no idea why you'd prefer your own implementation.<p>One constraint might be that you promise your buddy allocator's usage is single-threaded, and thus no locking is necessary. But a production-quality system allocator likely uses thread-local or CPU-local caches to accomplish much the same thing, so I don't think this constraint buys you much efficiency.<p>If you're implementing a bare-metal system on a microcontroller and need a very small memory allocator implementation for light usage, implementing your own buddy allocator might make sense. But that's a lot of caveats...
Hmm, a few more allocators to mention.<p>0. Garbage Collection: Reference Counting -- C++, Python, Rust. Reference counting augments any of the allocators discussed in the blogpost with a simple garbage-collection scheme.<p>1. Garbage Collection: Stop and Copy -- This is the easiest garbage collection to implement. This is closely related to the Linear-Allocator, except "free" compiles into a nop. Only "new" matters. When "new" runs out of space (which it will inevitably do), you scan the entire memory space, and copy all "current live objects" to the 2nd heap. (An identically sized heap).<p>Stop and copy uses up 50% of the space (If you have 32GB available, you can only have 2x16GB heaps, and can only use up to 16GBs total). Still, the gross simplicity of stop-and-copy is cool.<p>2. Garbage Collection -- Mark and Sweep: is a bit complicated in my experience. It works well in practice, but its harder to program (especially if you want to do it efficiently, with tri-color invariants and all that).<p>--------<p>Because of the gross simplicity of alloc() in stop-and-copy, it can be very fast in some use cases! If you can't figure out how to make free() work with linear allocators, just stop-and-copy the darn thing.<p>-------<p>Garbage collection is reserved for when your links start to point to each other in ways far more complicated thank linked lists.<p>If you have cycles in your linked lists ... or your linked lists act like cons pairs (aka: car / cdr) from Lisp, garbage collection is basically almost inevitable. There's no real way to know when to delete a pointer safely if you're using Lisp-style car/cdr pairs. Even reference counting is bad, because cycles are pretty common in that style of programming.
I've gotten a lot of mileage out of the following two extensions to the linear allocator:<p>1. A "pop-to" operation. If you're ready to free ALL memory allocated at or after allocation X, you can do this by moving the start-of-free-space pointer back to the start of allocation X.<p>2. An end-of-free-space pointer, with its own push and pop-to operations.<p>In combination, these are often enough to get maximum value out of a fixed-size memory workspace. A function can allocate all returned data structures on one side of the workspace, and use the other side of the workspace for all temporary allocations. This approach does require tight coupling so there are a lot of settings where it clearly doesn't belong, but I've found it to be very effective for high-performance scientific code.
Would recommend watching the "What's a Memory Allocator, Anyway?" talk from Zig contributor Benjamin Feng:<p><a href="https://www.youtube.com/watch?v=vHWiDx_l4V0" rel="nofollow">https://www.youtube.com/watch?v=vHWiDx_l4V0</a>
From TFA:<p>> The trick is to use the allocations themselves as linked list nodes so you don’t have to waste any extra memory for tracking.<p>There was an article that I'm struggling to find—posted to hn iirc—recommending against this approach. IIRC there were two reasons stated:<p>* Efficiency due to CPU cache. Freed allocations might have been unused for a long time and thus out of cache. Writing the pointer to the linked list on free puts it back into cache—a whole cache line when you only need 8 bytes or whatever for the singly-linked list—possibly at the expense of something hotter. (I think there are special instructions on many platforms to store without adding to cache, but that's probably not wise either because a bunch of frees might come together and thus need to access a previous one's book-keeping info. Better to keep all the bookkeeping info together in a dense region so you use more of the cache line.)<p>* Debugging troubles on buggy applications. They're more likely to overwrite adjacent allocations and thus the memory allocator's book-keeping, resulting in hard-to-debug failures. (I recall not being super convinced of this reason—I think having external tracking is insufficient for making it easy to debug this kind of error. I think you'd want to make it possible for ASAN to work well, thus you'd want to implement its "manual poisoning" interface.)<p>edit: I found this tcmalloc issue talking about not wanting to use singly-linked lists because of cache effects when moving stuff from the thread cache to the central list. Kind of similar. <a href="https://github.com/gperftools/gperftools/issues/951" rel="nofollow">https://github.com/gperftools/gperftools/issues/951</a>
One more allocator to mention:<p>* Fixed size allocator<p>If you don't need multiple sizes (like malloc(size)), then a fixed-size allocator is significantly simpler to implement, to the point of absurdity.<p>Step 1: Put all valid memory locations into a stack.<p>Step 2: alloc with stack.pop();<p>Step 3: free with stack.push();<p>The end. You get memory locality, cache-friendliness, O(1) operations, zero external fragmentation. The works. Bonus points: stack.push() and stack.pop() can be implemented with SIMD with help of stream-compaction (<a href="http://www.cse.chalmers.se/~uffe/streamcompaction.pdf" rel="nofollow">http://www.cse.chalmers.se/~uffe/streamcompaction.pdf</a>), and can therefore serve as a GPU-malloc as long as one-size is sufficient.<p>--------------<p>The only downsize of "fixed size" allocation is the significant amount of internal fragmentation per node. (Ex: if you only use 8-bytes per node, your 64-byte reservation is mostly wasted). But if you're making a custom-allocator, fixed-size is probably one of the easiest to implement in practice.<p>-------------<p>You can extend this to multithreaded operations by simply using atomic-swap, atomic-and, and atomic-or across a shared bitset between threads. 1-bit per fixed-size block. (bit == 1 means something has been alloc(). bit == 0 means something is free()). The stack.push() and stack.pop() operations can run thread-local.<p>I recommend 64-bytes or 128-bytes as your "fixed size", because 64-bytes is the smallest you get before false-sharing becomes a thing.<p>---------<p>If you have a fixed-size allocator but need items to extend beyond a fixed-size array, then learn to use linked lists.
For C++, I walked (halfway) the really long road of replacing std:: with std::pmr:: where I can, safely pluging jemalloc - and solve the performance issues we had with the standard heap allocator (Windows). But std::pmr::string now being 8 bytes bigger caused us slowdowns unexpectedly. Also tons of changes, different types, overall not great.<p>Then discovered mimalloc, which hooks at low-level malloc/free, and at the same time a coworker also did similar job and now I'm no longer in love with "pmr".<p>I wish it was success story, but it's not. It probably is, if everyone is on board with it. But hardly this was my case. Also libraries are not - Qt is not, and many others.<p>pmr is great for things like - try to alloc on the stack, and if not enough continue with heap (under the hood), but the extra 8 bytes, and the fact that do_alloc is virtual call is problem (last one maybe not perf enough, but still hard to sell to performance oriented programmers).<p>I wish it was designed in way where it could've been sold as such.
I know custom allocators are important for tiny platforms and so on, but what's a reason C/C++ can't allocate memory in a satisfactory way by default on those platforms? And how do other languages get away with it?
Andrei Alexandrescu's 2015 talk about C++'s allocators is great. He's a very entertaining presenter, and he makes case for extremely composable templated allocators (and why std::allocator isn't that).<p><a href="https://www.youtube.com/watch?v=LIb3L4vKZ7U" rel="nofollow">https://www.youtube.com/watch?v=LIb3L4vKZ7U</a>
if people are interested in this I'd recommend Silvio Cesare's new hackerspace blog. He's been publishing what appears to be a running series for a year and change now on heap attacks, highly illuminating. Several of the posts are attacks on embedded and allocators other than the main one in glibc<p><a href="https://blog.infosectcbr.com.au/" rel="nofollow">https://blog.infosectcbr.com.au/</a>
Tangential, but I just learned (TIL) that C is platform-independent in the following way, compared to Java:<p>- C is "write once, compile everywhere, run everywhere"<p>- Java is "write once, compile once, run everywhere"