I may have misunderstood, but I believe step 1 (eliding loads) is simply a cache scheduling problem. The optimal solution is the greedy "furthest in the future" eviction policy.
I'd like to see this same technique applied to x86, and what the performance is like without the "illegal instructions" (omitting them from generation would probably be trivial). It's relatively well known that one of the ways Asm programmers can beat compilers is on instruction selection, and that's what this technique seems to excel at.
This does far more than I did back in my PET and then Amiga days, but one thing I did was write a multi-pass compiler. Each pass at first found ways to make the come better (usually smaller so it ran faster). Even simple code I wrote could see a 10-20% improvement.<p>Of-course, this is because the original code was quick and dirty. I wonder what improvement modern compilers could have added.
All the other compilers in the comparison are C compilers, right? Whereas this compiler is compiling its own home made language? So not sure how the comparison can be valid.
Bookmarked this to read about optimizers, because it looks great.<p>That said, I clicked on the link because it had "6502" in the title. And... this isn't very interesting as a retrocomputing activity. To be blunt: there's absolutely no way in hell a compiler architecture like that is ever going to be self-hosting in 64k of memory space.