(I realise the poster came up with a different solution ultimately, but this bears repeating.)<p>Don't hand-roll standard library functions. You'll very likely regret it later.<p>In particular, as hardware advances, OS vendors often update their implementations with optimisations specific to new hardware platforms. This both allows improvements for all programs that consume those interfaces, and prevention of regressions when hardware changes in some cases.<p>In addition, you really don't want to be the one that introduces a security bug into your program because it turns out your optimisation work didn't account for a scenario that the standard library function does.<p>I'm aware that there may be cases where some believe it's appropriate; don't give into temptation. It's a bad idea to duplicate the standard library.
It's interesting that he points out that testing for zero is cheaper than comparing two numbers. Interesting because this might not always be the case.<p>For a quick test I used the conditions (i=0; i < NUMBER; i++) and (i=NUMBER;i != 0; i--). Without optimizations turned on, gcc translates both of them to cmpl (compare long) operations followed by a jne (jump not equal). So on an assembly level, they're absolutely identical except that in the one case you're testing against zero whilst in the other you're testing against the number. Fine, so we'll have to dig deeper.<p>And this is where it gets complicated. This optimization depends entirely on the inner workings of the ALU. Theoretically one can test against zero with just one subtraction, because 0-n == n-0 is always true, whereas a-b == b-a iff a == b, otherwise the two sides will differ in sign. On a hardware level such operations might be parallelized inside of the ALU though, so a comparison of two numbers might actually take exactly as many clock cycles as the comparison against zero.<p>It's however an interesting excursion into how such basic things work. And concerning my test: They're both about equally fast on my machine. I've done multiple runs with no clear winner.
Any more detailed information about the coding challenge?<p>For example the repo references a 5Murls.txt file, but it isn't part of the repo and the blog says it needs to "output in a standardized format:", but doesn't specify what "standardized" means nor does the current code actually output anything (the printf is disabled). Does it specifically need to go to stdout or just that it exists somewhere in memory? Does it require a char* or will this char* be instantly hashed and tossed away? Does the challenge forbid the use of threads/cpus/workers? In the blog it says: "In this particular context the plugin architecture we were writing against allowed for returning the original string or a new malloc’d string" Are you allowed to muck with the original string? In the code on github it has a function that takes a const char* forcing a malloc, but it could easily just re-write the string in place if allowed.<p>edit:
As for sorting the params, (based upon your comments about the common usage) pretty sure there is a way to do this without any string comparisons at all. Post a sanitized 5Murls.txt file and give it a go to make a patch.
I made a varnish urlsort module a while ago:
<a href="https://github.com/cyberroadie/varnish-urlsort" rel="nofollow">https://github.com/cyberroadie/varnish-urlsort</a><p>Here my blog post about it:
<a href="http://cyberroadie.wordpress.com/2012/01/05/varnish-reordering-query-string/" rel="nofollow">http://cyberroadie.wordpress.com/2012/01/05/varnish-reorderi...</a><p>It parses the url and every token gets added to a binary tree which than gets traversed in order to get the parameters alphabetically<p>It's been used in production in several companies I worked for, so tried and tested :-)<p>Olivier
As a note to the author about separating the arrays instead of using a structure, it was probably an alignment issue? Check out <a href="http://en.wikipedia.org/wiki/Data_structure_alignment" rel="nofollow">http://en.wikipedia.org/wiki/Data_structure_alignment</a>
> Pointer arithmetic
I have always heard of pointer arithmetic but now I understand why it’s so useful. Taking a pointer to the first element in an array and incrementing it to iterate through it’s elements is elegant and more efficient than taking an integer i and using it as an array index. <i></i>I wish more languages afforded this.<i></i><p>Trust me, no you don't. Pointer arithmetic is one of the easiest things to screw up in C, and the raw form can be very dangerous to work with.<p>For sure, treating a pointer as an integer and incrementing by a fixed number of bytes each time is fine, but more often than not languages have this implemented, just as a JIT or compiler optimization.
Good read. Reminds me of the oldie but goodie <a href="http://ridiculousfish.com/blog/posts/old-age-and-treachery.html" rel="nofollow">http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...</a>
I'm trying to picture a non-synthetic program where normalizing a URL is a significant part of the inner loop and I'm drawing a blank. Can someone help me out?
Very interesting. The author at first seemed not that familiar with C, but in the end he got a good result<p>Depends also if there's pre validation (even with C you're never sure, so for example, reading that he used strcpy makes my head go "security hole", but this is a proof-of-concept so it's ok)<p>The best thing to keep in mind is that "There is no such thing as the fastest code"<p>Edit: strcmp to strcpy (or any str C function)
The hand rolled strcmp almost certainly performed better than the native one merely because it can be inlined; consider using rep cmpsb or directly including another assembly version of strcmp (though since the strings are short the latter's overhead might not be worth it). Ditto memcpy.<p>Perhaps also consider using a radix sort?
Good work, one obvious speed up is, if possible, removing the use of dynamic memory. Simply supply a static output buffer to url_sort so that it doesn't have to allocate/free memory for the resulting URL.<p>That took my execution for 5M lines of (not really representative but it's all I could be bothered to sort out at 11pm):<p><pre><code> /test.php?b=b&c=bb&a=3&a=4
</code></pre>
down from 5.05s to 4.49s (so down 10%) on an old 1.5GHz Athlon Linux box that whirrs away in the corner of my flat. (Down to 2.81s when compiled with -O4).<p>(I assume you're compiling with -O4 as that's the only way I could make it worthwhile using your str_compare() function over libc's strcmp().<p>The only other thing I can think of that may speed it up more is a kind of binary search on the insertion as that is currently just a flat loop that does a comparison against each param; and therefore O(n).<p>In a pessimal case where you've already got 30 params and your 31st is going just before the tail (but not as the tail so you can't short cut straight to adding it as the tail) then it'll do a whole bunch of string comparisons on entries 1, 2, 3, ..., 30 before finding the right place. It'd be better to do them on, say, 1, 15, 23, 27, 29, 30. (This will only be beneficial when you have more than a certain number of params). An O(log n) algorithm will hopefully outweigh the slight expense of having to compute the midpoints.)<p>But, given you've got to 2M urls/sec, do you really need to eek out any more performance? That's an awful lot of traffic if you think each HTTP request will be about 500 bytes (including TCP headers, HTTP headers, etc). 2M * 500 bytes * 8bits = ~8Gbps, and that's just the incoming traffic from established connections.<p>For the binary search suggestion I get to point out the minor nitpicks:-<p>* Not all C compilers accept C++ style comments (gcc is way too lenient by default)<p>* Not all C compilers accept local variables to be defined mid-function<p>* unsigned long is not the same as size_t (fun fun fun when porting to different word sizes)<p>* If you ever have to port this code to work on a different CPU you may find yourself spending time adding lots of casts to (ptrdiff_t) to avoid it breaking due to unexpected sign extensions/etc.<p>These may seem OTT for a project like this, but it's not trivial when you've continued those practicises for years (and new devs have copied the house style) and have a bunch of projects totalling 20MLOC+ that you then need to port to a new compiler on a new architecture (with a different word size) and the code is riddled with assumptions about the old compilers and the architectures. Some things can be automated but spending <i>days</i> moving variable definitions to the top of functions and fixing types of variables that hold return values of strlen() in an seemingly endless list of thousands of files gets boring after a very short period. Me? Bitter?<p>"gcc -Wall -ansi -pedantic" and lint should be your friends.
Your insertion sort scans downwards from the highest populated index, bumping elements up as it goes. Given that you have space on both sides, I'm surprised that you didn't start in the middle and pick a scan direction based on the comparison result there.
We're getting ready to deploy Varnish in (non-devops) work. Not having control over production and having the potential for C inline in the config was already making me a bit nervous. This thread is not helping!
I've never come across the Duff's Device previously and after staring at it for a few minutes it made horrifying sense. I can't tell whether it's a piece of true evil or true genius.