Wow, the training time difference is much larger than I thought it would be. The main reason I use Flux.jl is because of its flexibility to just throw existing Julia libraries into it (DiffEqFlux.jl for neural ODEs, neural SDEs, neural PDEs, neural jump diffusions, etc. took surprisingly little work). However, I assumed that the CPU kernels would all be relatively comparable between the different neural network frameworks. This is a quite a compelling example that, at least for small neural networks and little data, the overhead can be quite large (quite large as in 10x!).<p>As the size of the neural networks grow this will fade away due to spending more time in the BLAS kernels. However, for non-standard applications which don't spend their time mostly in a neural network (like many neural differential equations according to our profiling), this difference would make or break an application.