There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap Analysis Tool") that's designed to help find exactly these sorts of excessive allocations.<p>Here's an old blog post describing it, by DHAT's author: <a href="https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with-dhat/" rel="nofollow">https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with...</a><p>Here's another blog post in which I describe how I used it to speed up the Rust compiler significantly:
<a href="https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed-up-the-rust-compiler/" rel="nofollow">https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed...</a><p>And here is the user manual:
<a href="http://valgrind.org/docs/manual/dh-manual.html" rel="nofollow">http://valgrind.org/docs/manual/dh-manual.html</a>
For the benefit of others who found the description in the blog post unclear and can't or don't want to dig through the code changes themselves: "fixing the hash code and the linked list code to not use mallocs" is a bit misleading. Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload. So it's one malloc instead of two per dynamically allocated list element. This explains the "down to 80 allocations from the 115" part.<p>The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).
I think this is fantastic engineering work towards performance, without falling back on the "RAM is cheap" line and instead doing nothing.<p>It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.
Underlying problem is that C doesn't have comprehensive standard collections, so many developers reinvent the wheel over and over again, and usually that wheel is far from best in the world. If curl was written with C++, those optimizations would be applied automatically by using STL collections.
My problem with excessive allocations is usually what happens in interpreted languages. People think, hey it's already slow ass interpreted so lets not care about allocation at all.<p>An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.<p>Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.<p>When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...<p>Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).
Explicit dynamic memory handling in low level languages hurts in a similar way garbage collectors do in high level languages: hidden and often unpredictable execution costs (malloc/realloc/free internally usually implement O(n log n) algorithms, or worse). So the point for performance, no matter if you work with low level or high level languages is to use preallocated data structures, when possible. That way you'll have low fragmentation and fast execution because not calling the allocator/dealocator in the case of explicit dynamic memory handling, and lower garbage collector pressure because of the same reasons in the case of a garbage-collected languages.
My rule of thumb is to look at application's design and only ever use malloc where there is 1:N (or N:M) relationship between entities. Everything that's 1:1 should be allocated in a single step.
> Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.<p>That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.<p>But given that curl runs on lots of platforms it makes sense to just fix the code.
Note that this pattern[0] is essentially "copy-on-write", which can be encapsulated safely as such in a reasonably simple type (in a language with generics) and used elsewhere. I use a similar mechanism pervasively in some low-level web server code to use references into the query string, body and JSON objects directly when possible, and allocated strings when not.<p>[0] <a href="https://github.com/curl/curl/commit/5f1163517e1597339d" rel="nofollow">https://github.com/curl/curl/commit/5f1163517e1597339d</a>
> The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.<p>Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.
> There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.<p>"I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.<p>Why not eliminate that doubt by applying the malloc optimizations to the previous official release?
I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).
> This time the generic linked list functions got<p>> converted to become malloc-less (the way<p>> linked list functions should behave, really).<p>I don't see how a linked list can not use malloc().