Optimization is a bit different between an FFT algorithm running on a CPU architecture versus FPGA/hardware implementation.<p>If you’re making a digital filter on FPGA you are going to be optimizing your structure with a DIF-FFT to produce out of order results followed by DIT-iFFT which accepts the out of order data. The arithmetic irregularities mentioned in the article about the DIF and split-radix structure don't factor in the same when you control the hardware, the complex multiply is implemented with 3muls, and 5adds and twiddles are better computed than wasting transistors to store them.
I remember attending a conference presentation back in the early 2000's by one of the FFTW authors. They claimed that the mapping between architecture and optimal FT algorithm was complex enough that the only sensible approach was implementing several and empirically determining the best one at runtime.
I wrote an optimized FFT for fun a while ago and a lot of this is quite familiar. Optimized FFTs are a fascinating field with a long history. I wouldn't recommend writing one from scratch for production instead of using an existing library, but it's a good exercise.<p>Using a real-to-complex FFT is really significant for performance and important to start with, as it places some additional constraints on the main FFT. In particular, the butterfly needed in the r2c and c2r passes isn't very amenable to working in bit-reversed order, so the trick of processing frequency domain in bit reversed order doesn't necessarily work. It's also important for comparison against the Fast Hartley Transform, which looks good performance-wise against a complex FFT but not against a real FFT.<p>I also found that radix-4 performed better than split-radix or conjugate pair FFT with SSE2/AVX SIMD. Both the instruction and data flow is cleaner, and the CPU has an easier time flooding the FMA units with simple loops than the more chaotic data flow of SRFFT/CPFFT. An FMA-based radix-4 loop can easily keep the FMA units at >95% utilization.<p>For data ordering, the vector-interleaved format mentioned is indeed great for the main passes, but real/imag interleaved turns out to have some benefits for the smallest butterflies. What worked best in my case was to do the deinterleave/transpose as part of an initial radix-8 pass that also handled the bit reversal a cache line at a time.