Regarding the opcode dispatch: setting up the RTS in this way is quite expensive, and (if you've got the room) you could be better off assembling a little thunk somewher in memory. 4C 00 >SET (JMP >SET*256). You'd do this on startup.<p>A JMP to this thunk costs 3 cycles, and the JMP in the thunk costs 3 cycles, so that buys you nothing compared to the RTS. And the STx to set up the low byte takes up 3 cycles (zero page) or 4 cycles (elsewhere), which is the same or worse than the PHA. But because the high byte is always set up, you save the 5 cycles spent setting that up.<p>(If you're running from RAM, you don't even need the thunk.)<p>(Also: the opcode dispatch's EOR trick is space-efficient, but takes an extra cycle - and one fewer bytes, I won't deny - compared to doing a TAY after fetching the byte, then a TYA:AND $F0 later. That sequence takes 6 cycles, whereas the LSR:EOR (R15L),Y sequence takes 7 or 8.)