Another resource on the same topic: <a href="https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-using-x86inc-asm/" rel="nofollow">https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...</a><p>As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.<p>FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.<p>dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.<p>While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.<p>I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.
I used to do quite a bit of SIMD version of critical functions, but now I rarely do -- one thing to try is isolate that code, and run it in the Most Excellent Compiler Explorer [0].<p>And stare at the generated code!<p>More often than not, the auto-vectorisation now generates pretty excellent SIMD version of your function, and all you have to do is 'hint' the compiler -- for example explicitly list alignment, provide your own vector source/destination type -- you can do a lot by 'styling' your C code while thinking about what the compiler might be able to do with it -- for example, use extra intermediary variables, really break down all the operations you want etc.<p>Worst case if REALLY the compiler isn't clever enough, this give you a good base to adapt the generated assembly to tweak, without having to actually write the boilerplate bits.<p>In most case, the resulting C function will be vectorized as good, or better than the hand coded one I'd do -- and in many other cases, it's "close enough" not to matter that much. The other good news is that that code will probably vectorize fine for WASM and NEON etc without having to have explicit versions.<p>[0] <a href="https://godbolt.org/" rel="nofollow">https://godbolt.org/</a>
I'm curious from anyone who has done it. Is there any "pleasure" to be had in learning or implementing assembly (like there is for LISP or RISC-V) or is it something you learn and implement because you want to do something else (like learning COBOL if you need to work with certain kinds of systems). It has always piqued my interest but I don't have a good reason in my day-to-day job to get into it. Wondering if it is worth committing some time to for the fun of it.
I personally don't think there's much value in writing assembly (vs using intrinsics), but it's been really helpful to read it. I have often used Compiler Explorer (<a href="https://godbolt.org/" rel="nofollow">https://godbolt.org/</a>) to look at the assembly generated and understand optimizations that compilers perform when optimizing for performance.
Kudos for the K&R reference! That was the book I bought to learn C and programming in general. I had initially tried C++ as my first language but I found it too abstract to learn because I kept asking what was going on underneath the hood.
This is perfect. I used to know the x86 assembly at the time of 386, but for the more advanced processors, it was too complex. I'd definitely like to learn more about SIMD on recent CPUs, so this seems like a great resource.
> Note that the “q” suffix refers to the size of the pointer *(*i.e in C it represents *sizeof(*src) == 8 on 64-bit systems, and x86asm is smart enough to use 32-bit on 32-bit systems) but the underlying load is 128-bit.<p>I find that sentence confusing.<p>I assume that i.e is supposed to be i.e., but What is *(* supposed to mean? Shouldn't that be just an open parenthesis?<p>In what context would *sizeof(*src) be considered valid? As far as I know, sizeof never yields a pointer.<p>I get the impression that someone sprinkled random asterisks in that sentence, or maybe tried to mix asterisks-denoting-italics with C syntax.
Asm is 10x faster than C? That was definitely true at some point but is it still true today? Have compilers really stagnated so badly they can't come close to hand coded asm?
"Assembly language of FFmpeg" leads me to think of -filter_complex. It's not for human consumption even once you know many of its gotchas (-ss and keyframes, PTS, labeling and using chain outputs, fading, mixing resolutions etc).<p>But then again no-one is adjusting timestamps manually in batch scripts, so a high-level script on top of filter_complex doesn't have much purpose.
I'm halfway through this tutorial and I'm really enjoying it. I haven't touched assembly since back in university decades ago. I've always had an urge to optimize processes for some reason. This scratches that itch. I was also more curious about SIMD since hearing about it on Digital Foundry.
It doesn't mention the downsides of using assembly. The biggest of which is that your code is architecture specific, so for example you have to write different code for x86 and arm, and possibly even different code for x86_64. Unfortunately, for SIMD, there isn't really a great way to write portable code for it, at least in C. Rust is working on stabilizing a portable simd API, and zig has simd support, but I suspect ffmpeg would still complain they aren't quite as fast as they would like.<p>One thing that confuses me is the opposition to inline asm. It seems like inline asm would be more efficient than having to make a function call to an asm function.
SIMD was introduced in the 80s but become ubiquitous when Intel got in on it in the 90s. It's interesting that (for x86), PLT is still stuck at hand-writing assembly 40 years later.
Uhmmm...Lots of praise but these are just three small lessons covering basics. Exercises not uploaded yet. Looks like a work in progress or in the beginning?
I'm kind of stunned we haven't gotten something better / more rust based than ffmpeg?<p>Especially curious given the advent of apple metal etc.<p>Does anyone have recommendations?
I'll be honest, I didn't read through much. Ffmpeg gives me severe ptsd. My first task out of college was to write a procedurally generated video using ffmpeg, conform to dash, and get it under 150kb/s while being readable. Docs were unusable. Dash was only a few months old. And stackoverflow was devoid of help. I kid you not, the only way to get any insight was some sketchy IRC channel. (2016 btw, well past IRCs prime)