Fewer mallocs in curl

392 pointsby dosshellabout 8 years ago

15 comments

nnethercoteabout 8 years ago

There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap Analysis Tool") that's designed to help find exactly these sorts of excessive allocations.Here's an old blog post describing it, by DHAT's author: <a href="https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with-dhat/" rel="nofollow">https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with...</a>Here's another blog post in which I describe how I used it to speed up the Rust compiler significantly: <a href="https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed-up-the-rust-compiler/" rel="nofollow">https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed...</a>And here is the user manual: <a href="http://valgrind.org/docs/manual/dh-manual.html" rel="nofollow">http://valgrind.org/docs/manual/dh-manual.html</a>

评论 #14184564 未加载

tom_melliorabout 8 years ago

For the benefit of others who found the description in the blog post unclear and can't or don't want to dig through the code changes themselves: "fixing the hash code and the linked list code to not use mallocs" is a bit misleading. Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload. So it's one malloc instead of two per dynamically allocated list element. This explains the "down to 80 allocations from the 115" part.The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).

评论 #14178765 未加载

评论 #14178359 未加载

评论 #14182199 未加载

makerbrakerabout 8 years ago

I think this is fantastic engineering work towards performance, without falling back on the "RAM is cheap" line and instead doing nothing.It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.

评论 #14178583 未加载

评论 #14177927 未加载

评论 #14178638 未加载

评论 #14178987 未加载

评论 #14179874 未加载

评论 #14179809 未加载

vbezhenarabout 8 years ago

Underlying problem is that C doesn't have comprehensive standard collections, so many developers reinvent the wheel over and over again, and usually that wheel is far from best in the world. If curl was written with C++, those optimizations would be applied automatically by using STL collections.

评论 #14178992 未加载

评论 #14178990 未加载

评论 #14178985 未加载

评论 #14179636 未加载

iamalurkerabout 8 years ago

My problem with excessive allocations is usually what happens in interpreted languages. People think, hey it's already slow ass interpreted so lets not care about allocation at all.An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).

snksnkabout 8 years ago

Optimization backed by comparative statics. These reads are so satisfying. Thank you for submitting.

faragonabout 8 years ago

Explicit dynamic memory handling in low level languages hurts in a similar way garbage collectors do in high level languages: hidden and often unpredictable execution costs (malloc/realloc/free internally usually implement O(n log n) algorithms, or worse). So the point for performance, no matter if you work with low level or high level languages is to use preallocated data structures, when possible. That way you'll have low fragmentation and fast execution because not calling the allocator/dealocator in the case of explicit dynamic memory handling, and lower garbage collector pressure because of the same reasons in the case of a garbage-collected languages.

评论 #14180111 未加载

rumcajzabout 8 years ago

My rule of thumb is to look at application's design and only ever use malloc where there is 1:N (or N:M) relationship between entities. Everything that's 1:1 should be allocated in a single step.

评论 #14179431 未加载

评论 #14180096 未加载

21about 8 years ago

> Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.But given that curl runs on lots of platforms it makes sense to just fix the code.

评论 #14178209 未加载

vertex-fourabout 8 years ago

Note that this pattern[0] is essentially "copy-on-write", which can be encapsulated safely as such in a reasonably simple type (in a language with generics) and used elsewhere. I use a similar mechanism pervasively in some low-level web server code to use references into the query string, body and JSON objects directly when possible, and allocated strings when not.[0] <a href="https://github.com/curl/curl/commit/5f1163517e1597339d" rel="nofollow">https://github.com/curl/curl/commit/5f1163517e1597339d</a>

评论 #14178248 未加载

0xcde4c3dbabout 8 years ago

> The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.

评论 #14178858 未加载

评论 #14178245 未加载

评论 #14178164 未加载

hota_maziabout 8 years ago

> There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any."I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.Why not eliminate that doubt by applying the malloc optimizations to the previous official release? I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).

评论 #14178622 未加载

ape4about 8 years ago

> This time the generic linked list functions got> converted to become malloc-less (the way> linked list functions should behave, really).I don't see how a linked list can not use malloc().

评论 #14178239 未加载

评论 #14178304 未加载

amenghraabout 8 years ago

You would think curl's perf is bound by the network latency/bandwidth and that intrusive lists wouldn't make a signifiant difference.

__sabout 8 years ago

> The point here is of course not that it easily can transfer HTTP over 20GB/sec using a single core on my machine2GB

评论 #14181517 未加载