Here are a few ideas:<p>Use an extensible compiler and targeted optimizations. <a href="https://mitpress.mit.edu/books/automatic-algorithm-recognition-and-replacement" rel="nofollow">https://mitpress.mit.edu/books/automatic-algorithm-recogniti...</a> is an excellent book on this topic.<p>Use a cluster to evolve the best settings for compile options, executable layout, instruction scheduling, etc. There is a paper from a Google author about doing this for prefetching.<p>Use an ILP solver for register allocation, instruction scheduling and other problems that are normally solved with heuristics. The size of the program may make this intractable. There was a startup that used this approach for a custom programming language targeted at Intel's network processors.
(Assuming, we're not talking about distributed solutions)<p>The fastest way would be to create an ASIC: hardware designed to run your algorithm specifically.<p>Something simpler and a bit slower would be an FPGA.<p>Below that is a GPU implementation of your code, assuming it can be parallelized.<p>Below that is hand-crafted assembly by someone who is smarter than a good compiler.<p>Below that is hand-crafted C/C++ or Fortran code.<p>Here are benchmarks of various languages for various problems:<p>N-body: <a href="https://benchmarksgame.alioth.debian.org/u64q/performance.php?test=nbody" rel="nofollow">https://benchmarksgame.alioth.debian.org/u64q/performance.ph...</a><p>Spectral-norm: <a href="http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=spectralnorm" rel="nofollow">http://benchmarksgame.alioth.debian.org/u64q/performance.php...</a><p>Digits of pi: <a href="http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=pidigits" rel="nofollow">http://benchmarksgame.alioth.debian.org/u64q/performance.php...</a><p>FASTA: <a href="http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=fasta" rel="nofollow">http://benchmarksgame.alioth.debian.org/u64q/performance.php...</a>