Optimizing Haskell for performance can be tricky. When I wrote a functional image library (updated at [0]) I tried to get the frame draw rate down by using strictness, unpacked records, and so on, but one of the changes that made the biggest difference came from changing round to truncate[1], I went from 120 ms/frame to 65 ms/frame! The reason why is because round has quite a bit of logic associated with it[2] in Haskell's standard library, versus a simple one-liner for truncate.<p>It sort of shows that in even simple functions such as this, underlying implementation details can have an impact on performance.<p>Also, for some reason filling a C array in Haskell was faster than doing it in C, probably due to the switching between RTS.<p>[0] <a href="https://github.com/siraben/functional-images" rel="nofollow">https://github.com/siraben/functional-images</a><p>[1] <a href="https://github.com/sbond75/OpenGLTesting/commit/116384edd6a36d7cb828310fcc8bfbf6327cc47d" rel="nofollow">https://github.com/sbond75/OpenGLTesting/commit/116384edd6a3...</a><p>[2] <a href="https://hackage.haskell.org/package/base-4.14.0.0/docs/src/GHC.Real.html#truncate" rel="nofollow">https://hackage.haskell.org/package/base-4.14.0.0/docs/src/G...</a>
I find this article a nice counterpoint to “it is difficult to reason about performance in Haskell”. A person who had never programmed in the language, found all the tools and resources to implement an algorithm from a reference and make it fast.
GHC has to walk a tight line between A) applying enough optimization passes to reliably remove abstraction overhead and B) compiling fast enough that people won't complain about it.<p>It's pretty awesome what GHC can pull off in terms of reducing complex programs to a minimal normal form (you can see this with e.g. <a href="https://hackage.haskell.org/package/ghc-proofs-0.1.1/docs/GHC-Proof.html" rel="nofollow">https://hackage.haskell.org/package/ghc-proofs-0.1.1/docs/GH...</a> ), but often times you do have to convince the optimizer to be more aggressive than it wants to be.
I'm not sure it was his first Haskell program, if it was it's pretty impressive that he used INLINE, understood minute differences between Random libraries, profiling, threadscope, etc..
The hardest thing about optimizing is knowing whether something is fast. It helps if you have the another of the same thing that is already fast, or seems so. When you match it, are you now fast?<p>You are until somebody makes one faster. Then, you're slow again. That is our deal with the Devil: our machines are filled with dodgy gimcracks that often make our program faster, but we can hardly ever know whether it as as fast as it could be, or if some gimcrack that would make it faster hasn't engaged yet; or, often, even how to go about engaging it.<p>Usually we stop optimizing when it seems fast enough, usually because something else is (still) slower.<p>It turns out to be pretty simple to make quicksort go twice as fast on current hardware. Standard libraries and compilers generally don't do those things, often out of fear of making a few programs slower than before. People complain a lot more about a slower program than they thank you for making a hundred others faster--if they even notice the latter.
It's interesting that half way through the optimizations haskell is doing 3.3 terabytes of total heap allocation.<p>When it gets to 93 seconds it is doing 208 GB of heap allocation and in the final version it is still doing over 20 GB of heap allocations. This seems like quite a bit for a 384x216 image.
That's really cool. The next step would be using explicit SIMD instructions. Is there a SIMD library for Haskell? Or a library for automatically doing AoS->SoA transformation on some compute code?