i've been thinking about a closely related feature in a different context: adding block arguments, as in smalltalk or ruby or especially lobster, to a language more like c, with static types and stack allocation<p>i think this would be favorable for (among other things) clu-like iterators and imgui libraries, where you often want to do something like<p><pre><code> submenu("&Edit") {
command("&Cut") { clip_cut(getSelection()); }
...
}
</code></pre>
this is especially useful in a context where you're heap-allocating sparingly or not at all, because the subroutine taking the block argument can stack-allocate some resource, pass it to the block, and deallocate it once the block returns; python context managers and win32 paint messages are two cases where people commonly do this sort of thing, but things like save-excursion, with-output-file, transactional memory, and gsave/grestore also provide motivation<p>the conventional way to do this is to package up the block into a closure, then use a full-fledged function invocation to invoke it, using a calling convention that supports closures. but i suspect a more relaxed and efficient approach is to use an asymmetric coroutine calling convention, in which the callee yields back control to its caller at the entry point to the block, and the block then resumes the callee when it finishes. so instead of merely dividing registers into callee-saved and call-clobbered, as subroutine calling conventions do, we would divide them into callee-saved upon return but upon yield containing callee values the block must have restored upon resumption; caller coroutine context registers, which are callee-saved upon return and also on yield; and call-clobbered. you also need in many cases a way for the block to safely force an early exit from the callee<p>this allows the caller's local variables to be in registers its blocks can use without further ado, or at least indexed off of such a register, while allowing the yield and resume operations to be, in many cases, just a single machine instruction. and it does not require heap allocation<p>as an example of taking this to the point of absurdity, here's an untested subroutine for iterating over a nul-terminated string passed in r0 with a block passed in r1, using a hypothetical coroutine convention which passes at least r4 through from its caller to its blocks<p><pre><code> itersz: push {r6, r7, r8, lr}
mov r7, r0
mov r6, r1
1: ldrb r0, [r7], #1
cbz r0, 1f
blx r1
b 1b
1: pop {r6, r7, r8, pc}
</code></pre>
and here is another untested subroutine which uses it to calculate a string hash<p><pre><code> hashsz: push {r4, r5, r9, lr}
movs r4, #53
adr r1, 1f
blx itersz
mov r0, r4
pop {r4, r5, r9, pc}
1: eor r4, r0, r4, ror #27
bx lr
</code></pre>
even in this case where both the iteration and the visitor block are utterly trivial, the runtime overhead per item (compared to putting them in the same subroutine) is evidently extremely modest; my estimate is 7 cycles per byte rather than 4 cycles per byte on in-order hardware with simple branch prediction, so, on the order of 1 ns on the hardware russ used as his reference. for anything more complex the overhead should be insignificant<p>it's less general than the mechanism russ proposes here (it doesn't solve the celebrated samefringe problem), but it's also an order of magnitude more efficient, because the yield and resume operations are less work than a subroutine call, though still more work than, say, decrementing a register and jumping if nonzero