We tried mimalloc in ClickHouse and it is two times slower than jemalloc in our common use case
<a href="https://github.com/microsoft/mimalloc/issues/11" rel="nofollow">https://github.com/microsoft/mimalloc/issues/11</a>
Are there functions available with which I can at run-time query how much OS memory is used, how much handed out in allocations, how many mmap()ed pools are used, and so on?<p>I find that one of the most important features of a malloc library to debug memory usage.<p>glibc has these functions (like malloc_info()) -- they are very bugged in that they return wrong results, but after patching them to be correct, they are super useful.
Looks like the same idea as Konstantin Knizhnik's thread_alloc:<p><a href="http://www.garret.ru/threadalloc/readme.html" rel="nofollow">http://www.garret.ru/threadalloc/readme.html</a><p>At least the same architecture of allocated chunks management.
I always find comparisons with tcmalloc hard to parse, since it has a million knobs and the defaults are terrible. If they are running with 16 threads I would normally advise increasing the thread cache size far above the default 3MiB. also interesting would be jemalloc in per-CPU mode.<p>As always the thing to do is build and run your own workload and see the results.
The tricky part with allocators is always the multi-threaded setups.<p>Even something as simple as a bunch of threads doing malloc-free in a loop will drop performance of a lot of allocators to the floor, due to some sort of central locking or excessive cache thrashing. This is typically solved by adding per-thread block pools, free lists or some such.<p>If you go further down the rabbit hole, there's a case when blocks are allocated in one thread and freed in another, your very typical producer-consumer setup. This too further complicates things with the pool/freelist setup and requires periodic rebalancing of freelists and pools.<p>So once all this is accommodated, a well-tuned allocator inevitably converges to a model with central slabs/pools/freelists and per-thread caches of the same, which are periodically flushed into the former. Then it all comes down to routine code optimization to make fastpaths fast, through lock-free data structures, some clever tricks and what not.<p>In other words, it's always nice to read through someone's allocator code, but in the end this is a very well-explored area and there's basically a single stable point once all common scenarios are considered.
The benchmarks are very impressive! I am excited to read through this code and think on it.<p>Edit: They do mention they're all from AMD's EPYC chip, which is a little idiosyncratic. Speculation: perhaps page locality is more important on this architecture.
Just a general question in regards to using memory allocators, in the consideration of a C only application.<p>The problems I encounter with allocator and heap manager are almost never solved by these types of frameworks. These problems include:<p>1. Improper usage of the memory returned that contradict implementation.
2. Pool allocators that don't have separation between individual blocks (performance reasons).
3. Specifying the lifetime of the memory to a thread or until specific events happen.
4. Difficult to diagnose corruption, with any tool available.<p>Here's a specific scenario I deal with very often:
There are N persistent worker threads. These worker threads have their own pool of memory, and prior to getting work we know this pool is clean. After the work is finished and before more work is recieved the memory is cleaned. Any excess requested memory is returned to the global-pool, and any memory that is "unmanaged" is dealt with properly.<p>This means that people can do whatever heap management call you use (void * obtainMemory(size_t);) in the scope of business logic without having to worry about infrastructure concerns.<p>Having a faster malloc/calloc doesn't benefit me as much as making the usage of memory easier, and the understanding of what happens easier.
The important thing about all this is to measure perf on realistic workloads before and after. I don't really believe in allocators that have "excellent performance" on everything.
The Dev at Discourse also try it with Ruby, the result aren't as good as jemalloc. [1]<p>[1] <a href="https://twitter.com/samsaffron/status/1143048590555697152" rel="nofollow">https://twitter.com/samsaffron/status/1143048590555697152</a>