According to DJB, what makes it so fast is deceptive benchmarks, <a href="https://cr.yp.to/djbfft/bench-notes.html" rel="nofollow">https://cr.yp.to/djbfft/bench-notes.html</a><p>Maybe something has changed in the 2 decades since that? Their benchmark page seems to have not really changed in at least a decade.<p>I would guess that folks making use of GPUs could probably get quite a speedup on this type of numerical workload in 2018.
I don't know anything about FFTW, but the question/answer seem misleading:<p>GitHub language details of the <i>non-generated</i> sources say 75% C: <a href="https://github.com/FFTW/fftw3" rel="nofollow">https://github.com/FFTW/fftw3</a><p>FFTW author comment linked from Quora answer contradicts said answer (~2/3 C): <a href="https://groups.google.com/d/msg/fa.caml/B5kFMTl67MU/9c8swiOE0M8J" rel="nofollow">https://groups.google.com/d/msg/fa.caml/B5kFMTl67MU/9c8swiOE...</a><p>Am I missing something here? Double generation?
Since the linked explanation assumes you know what it is and what it stands for: Fastest Fourier Transform in the West, <a href="https://en.wikipedia.org/wiki/FFTW" rel="nofollow">https://en.wikipedia.org/wiki/FFTW</a>
FFTW certainly isn't simply written on OCaml. Apart from higher-level C, at base it has the sort of low-level micro-architecture-specific kernels you'd expect (using C intrinsics rather than assembler in this case).<p>[There was useful advice in <a href="https://stackoverflow.com/a/3058546" rel="nofollow">https://stackoverflow.com/a/3058546</a> for example, and doubtless more recently, concerning trolling.]
The problem I've always had with FFTs is that it's incredibly difficult to write an optimal FFT for X size, Y direction, Z variability (X=2^4-2^64, Y=FWD,REV,BIDIR, Z=2^4-2^(log2(X)))<p>Does the metaprogramming element of FFTW work well, or does it boil down to "If X=2^4, Y=FWD, Z=X: build24FWDX()"?
> run-time profiling to choose the fastest codelets (which is largely affected by the target architecture).<p>What's the quick overview of how this run-time profiling works?<p>The value of such a system seems immense but I only have seen it mentioned with reference to fftw. Have there been efforts to generalize that process to realtime DSP programming?