Exactly what the title sounds like. At least one person has gone very far down this rabbit hole, possibly two people. A joy to read through, thank you for posting.<p>This is the sort of comparison that would be great to do for a compiler runtime implementation. Integer division comes to mind. Hard to justify the time.
Although this isn't about the 6502, this reminds me of how I had to write x86_64 assembly for class, and was trying to use LEA to avoid a multiplication. After I wrote it, I wondered if the compiler did the same thing I did, but it didn't; the compiler's version used a register unnecessarily. Compare:<p><pre><code> ;; Goal: r8 := 10 * r9
;; Mine:
;; 10x = x + 9x
;; 9x = x + 8x
lea r8, [r9 + 8 * r9]
add r8, r9
;; Compiler's code:
;; 10x = 2 * 5x = 5x + 5x
;; 5x = x + 4x
lea rdi, [r9 + 4 * r9]
lea r8, [rdi + rdi]
</code></pre>
At first, I wondered why the compiler introduced a temporary.<p>However, I had just done the exercises from the book <i>The Structure and Interpretation of Computer Programs</i> that ask you to create recursive and iterative routines that multiply integers through successive doubling and halving, by taking advantage of the fact that Cx = C + (C - 1)x, if C is odd, and Cx = 2 * (C / 2)x, if C is even (although that's not exactly how they explained it; I don't perfectly remember and I don't feel like looking it up, even though this wasn't that long ago).<p>I realized: the compiler must be using a variant of the Russian peasant method of multiplication (which was the name of the method, as the book had explained) as a form of strength reduction. I was impressed, but I also realized that this meant that it was suboptimal for minimizing temporary registers. However, I couldn't think of some algorithm that would multiply it the way I did when I wrote that assembly, since I hadn't really had an algorithm in mind when writing it.
I don't miss those early days of programming. It's fun to optimize and it's amazing what you can do with minimal resources, but it took an incredible amount of effort to get a 70s-era computer to do elementary school arithmetic fast enough to be useful. People spent their lives working on analyses like this, only to find in a scant few years that the next chip has it built-in as a single instruction.
Just for fun... I think this should be on the efficient frontier of 8 x 8 -> 16 bit multipliers, at an average cost of around 43 cycles and memory of 1330 bytes or so. Compare 45.5 cycles and 1580 bytes for the mult66.a fastest in the collection linked.<p><pre><code> lmul0 = $06 ; pointer into square table low
lmul1 = $08 ; pointer into square table high
labs0 = $0a ; pointer into table of abs(i-1)
prod_low = $0c
\* = $0200
; Align tables to start of page
abstable
!for i, 0, 255 {
!byte <(abs(i-1))
}
squaretable1_lsb
!for i, 0, 511 {
!byte <((i*i)/4)
}
squaretable1_msb
!for i, 0, 511 {
!byte >((i*i)/4)
}
; multiply
;
; f(x) = x*x/4
;
; return f(X+Y) - f(|Y-X|)
;
; 8 bit x 8bit unsigned multiply, 16 bit result
;
; On Entry:
; A: multiplier
; Y: multiplicand
; On Exit:
; (prod_low, A): product
mult
sta lmul0
sta lmul1
eor #ff
sta labs0
ldx (labs0),Y ;X = abs(Y-A)
lda (lmul0),Y
sec
sbc squaretable1_lsb,X
sta prod_low
lda (lmul1),Y
sbc squaretable1_msb,X
rts
;call this once to initialise high bytes of pointers to table
mult_init
lda #>abstable
sta labs0+1
lda #>squaretable1_lsb
; high byte (#2 in this instance)
sta lmul0+1
lda #>squaretable1_msb
; high byte (#4 in this instance)
sta lmul1+1
rts</code></pre>
The customization section is interesting, though it is missing things such as multiply & conquer, table wrapping, and the mask & jump techniques, but then, you aren't going to find every custom technique covered in any one text. Sine, cosine, and atan, usually used dual interweaving tables that wrapped, and for distance, you could use a short table and a long table that used Manhattan distance to do the lookup depending on whether X > Y or Y > X.<p>Edit: I see a link to another github on that page to sqrt implementations. I suspect that will make for interesting reading too.
Yet again I am struck by how incredible it is that Braben and Bell, two undergraduate students, could produce something as ground-breaking as Elite—all the way down to writing their own multiplication routines.
My Apple II "Mode 7" demo relies heavily on a "table of squares" multiply alogorithm, though it's signed (vs unsigned) which is even more of a pain on 6502.
<a href="https://www.youtube.com/watch?v=o8427JsfGwk" rel="nofollow noreferrer">https://www.youtube.com/watch?v=o8427JsfGwk</a><p>So many 16-bit demo effects assume you have a decently fast multiply or divide instruction, which makes it really hard to do the same effects on 6502 (as I found with the second reality remake)
He is missing the state-of-the-art result for 16 x 16 -> 32 bit. That would be Repose in this CSDB thread : <a href="https://csdb.dk/forums/?roomid=11&topicid=91766" rel="nofollow noreferrer">https://csdb.dk/forums/?roomid=11&topicid=91766</a> at 190 cycles (depending on how exactly you count).
I’m writing a pinball game in assembly for the original Gameboy, and recently had to hand-craft multiply, division and cosine routines to trade off accuracy for spare cycles. Something like this would have come in handy!!
The graphical error plots for the versions that approximate are a joy to behold. You can clearly see the ringing and how the error bars tend to grow larger with bigger inputs.