Optimizing a bignum library for fun

145 pointsby azhenley10 months ago

15 comments

I've had to implement BigNum on an embedded system that had very little RAM and where the initial optimized version of ModExp took many minutes to complete. After much hair-pulling, the final version took 40 seconds.First, you should work with unsigned numbers, and use a power of 2 as your word size. The fastest choice for a word size is very operation- and CPU-dependent.A key trick is to lay out your bignums in the same order as the endianity of each word in memory, e.g. least-significant word first on little-endian systems. This will allow you to choose your word size dynamically for each operation: in memory, a number with M words of N bits each is identical to a number with M / 2 words of N * 2 bits each.For multiplication, identify the CPU instruction with the widest result, then use half that size as your word size. Each step through the arrays generates a word result in the low half and a word carry in the top half. The carry gets added to the result of the next step, possibly overflowing.For addition, use the widest result as your word size. This can also overflow.How you deal with overflows is very CPU-dependent. You can use adc/addc as someone else mentioned, which will be faster on embedded and may be faster on fatter chips. Alternatively, you can halve the word size and use the top half as the carry.If addc is not available, you can test for overflows as follows:<pre><code> uint32_t a = ..., b = ...; uint32_t res = a + b; uint32_t carry = res < a; </code></pre> On overflow, res must necessarily be less than both a and b, so no need to check b.If SIMD instructions are available, that will almost always be the fastest choice by far. While it doesn't change the above guidelines in principle, there are often e.g. optimized overflow mechanisms.

评论 #40984598 未加载

评论 #40986734 未加载

haberman10 months ago

Coincidentally I was just writing a bignum library from scratch two weeks ago.A few interesting things I learned:1. Knuth volume 2 has an extensive discussion of the problem space. I've only gotten a chance to skim it so far, but it looks interesting and approachable.2. I need to support bitwise operations, which operate on a two's complement representation, so I figured it would be simpler to use two's complement internally, despite seeing that most (all?) bignum libraries use signed magnitude. I'm starting to regret this: two's complement introduces a lot of complexity.The most fun thing about working on a bignum library is that it makes the algorithms you learned in grade school for add/subtract/multiply/divide relevant again. The basic ("classical") algorithms on bignum are basically exactly the same thing you learned in grade school, except on a much larger base than base-10.

评论 #40982953 未加载

minimize10 months ago

For maximum efficiency, you should work in binary instead of base 10. Handling carries becomes more straightforward with the right primitives, for example __builtin_addc with GCC: <a href="https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html" rel="nofollow">https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...</a>You can also implement it in C if you want a more portable solution: <a href="https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c526a0b83b24db90f9/bigint.c#L27">https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c52...</a>If you scroll around, you can also find my implementations for multiplication and such.

评论 #40979508 未加载

globular-toast10 months ago

Funny, I did this myself 10 years ago. Shit, it's been that long..?For my undergrad project I wrote a computer algebra system to do symbolic integration. The supervisor was a hardcore, old school C guy, so naturally I was going to just use C and no libraries. He told me I'd need bignums first, so I got to work (this is because many algorithms like polynomial GCD create massive numbers as they go, even if the inputs and final outputs are generally very small).I just couldn't figure out how to do better than the largest power of 10 per digit at the time. Working with non base 10 arithmetic was a mind fuck for me at the time. So I did it with digits holding 10^9 and the classical algorithms from Knuth. Division is the hardest!At some point I discovered the GNU multiple precision library (GMP) and made my program work with that instead of mine. I was shocked at how much faster GMP was! I finished my project with my own code, but I knew I had to come back to do it better.The breakthrough came when I got a copy of Hacker's Delight. It has stuff like how to detect overflow after it's happened (in C). Something twigged and then I just understood how to fill each word completely rather than use a power of 10. I don't know what confused me before.But, of course, the real way to do it is to use assembly. You can't get close to high performance in C alone. In assembly you get the overflow bit. It's actually easier in a way! So you write tiny platform specific bits for the digits and build on that in C. My add and subtract were then as fast as GMP. I lost interest when it came to implement faster multiplication algorithms.Code in case anyone is interested: <a href="https://github.com/georgek/bignums">https://github.com/georgek/bignums</a>

lifthrasiir10 months ago

Modulo is surprisingly expensive even when you combine it with a quotient. It is almost always better to use binary "limbs", in this case 31 or 32 bits wide, because decimal parsing and printing should be much rarer than individual operations in general.

评论 #40980656 未加载

guyomes10 months ago

For a small survey of practical efficient methods for bignum arithmetic operations, the Algorithms section of the documentation of GMP [1] is excellent.[1]: <a href="https://gmplib.org/manual/Algorithms" rel="nofollow">https://gmplib.org/manual/Algorithms</a>

tgot10 months ago

I think that your description is almost excellent, but that you're fundamentally misleading in describing what you are doing as a "30-bit" digit.It's a 10^9 digit mathematically, occupying 30 bits of storage. You do briefly mention that it's 10^9, but repeatedly say 30-bits.

评论 #40982511 未加载

nj5rq10 months ago

Fascinating article, I have always been wondering how these big number libraries worked.As a side question, does anyone the program that the author used when making that "addition" and "multiplication" performance graph? Thanks.

评论 #40984293 未加载

评论 #40981150 未加载

kazinator10 months ago

LOL, I've been there: <a href="https://www.kylheku.com/cgit/txr/commit/mpi?id=98dedd310b1d5d876b0fbb0ebd6c4df9bd7b2d88" rel="nofollow">https://www.kylheku.com/cgit/txr/commit/mpi?id=98dedd310b1d5...</a>(Patch was originally from 2011; it was bugfixed once, and then in 2015 converted to git commit:<a href="https://www.kylheku.com/cgit/txr/commit/mpi-patches/faster-square-root?id=ef47dfe4fcb7c1be369ae83221386b9da6474a1e" rel="nofollow">https://www.kylheku.com/cgit/txr/commit/mpi-patches/faster-s...</a> )Another one: faster highest-bit search:<a href="https://www.kylheku.com/cgit/txr/tree/mpi-patches/bit-search-optimizations?id=124e7dd6977a0853d7a8399921e31fd1ccde2dcb" rel="nofollow">https://www.kylheku.com/cgit/txr/tree/mpi-patches/bit-search...</a>That does use GCC built-ins today: <a href="https://www.kylheku.com/cgit/txr/commit/?id=15b7c542dc44899e8db7addfcc2f1c1c4a188b49" rel="nofollow">https://www.kylheku.com/cgit/txr/commit/?id=15b7c542dc44899e...</a>

parentheses10 months ago

I am so surprised that there's no exploration of Karatsuba's algorithm. That's what makes the Python implementation perform.I actually came here hoping to find discussion on Karatsuba. <a href="https://en.m.wikipedia.org/wiki/Karatsuba_algorithm" rel="nofollow">https://en.m.wikipedia.org/wiki/Karatsuba_algorithm</a>

评论 #40983391 未加载

chubot10 months ago

Very useful post! Also it's cool to see how many people in this thread have worked on this problem -- lots of new info here I haven't seenI wonder if anyone is interested in implementing a big numbers in Oils? It's a Unix shell with TWO complete implementations - the "executable spec" in Python, and an automatic translation to pure C++ (which is 30x-50x faster)We currently use 64-bit integers in C++, but big nums are a better semantic. Some trivia about bad shell semantics here:Integers - Don't do whatever Python or C++ does - <a href="https://www.oilshell.org/blog/2024/03/release-0.21.0.html#integers-dont-do-whatever-python-or-c-does" rel="nofollow">https://www.oilshell.org/blog/2024/03/release-0.21.0.html#in...</a>This is a very self-contained project: the interface is defined by 200 lines of Python:<a href="https://github.com/oilshell/oil/blob/master/mycpp/mops.py">https://github.com/oilshell/oil/blob/master/mycpp/mops.py</a>and the trivial 64-bit overflowing implementation is also about 200 lines:<a href="https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.h">https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.h</a><a href="https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.cc">https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.cc</a>(We have a fast Ninja-based build system, so you can probably iterate on this in 100 milliseconds or less -- it should be fun for the right person)---I think the main reason it is specific to Oils is that the bigger number should become GC objects. Details on our GC here:Pictures of a Working Garbage Collector - <a href="https://www.oilshell.org/blog/2023/01/garbage-collector.html" rel="nofollow">https://www.oilshell.org/blog/2023/01/garbage-collector.html</a>It's been very solid for the last 18 months, basically because it's well tested by ASAN and #ifdef testing modes.The main thing I'd be concerned with is how to TEST that big number operations are correct. I think there are probably some interesting strategies, which I'd love to discuss.You're of course welcome to adapt existing open source code, including code you've already written -- I probably even prefer that, i.e. something that has had some real world testing. We want all the operations in that file, and it should be integrated with our GC.---We've had ~6 contributors funded by grants from <a href="https://nlnet.nl" rel="nofollow">https://nlnet.nl</a> for the past couple years, so you can even be paid (there's a chance it depends on the country you live in, but generally a wide range of situations is OK).Contact me at andy at oilshell.org or <a href="https://oilshell.zulipchat.com/" rel="nofollow">https://oilshell.zulipchat.com/</a> if interested!

评论 #40988473 未加载

amelius10 months ago

The title is a little bit dishonest, because they are optimizing their own bignum library.

paldepind210 months ago

Speaking of bignum libraries, I recently watched a talk with Rob Pike where he mentioned that one thing he regretted about Go was not making the default integer implementation arbitrary precision. Supposedly the performance overhead for normal numbers is very small, and you avoid the weirdness and complicated semantics of fixed precision integers. I found that to be quite fascinating, especially coming from a "low-level guy" like Rob Pike. Ever since I've been wanting a language with that feature and to understand how bignum implementations work.

评论 #40980315 未加载

评论 #40980082 未加载

评论 #40980742 未加载

评论 #40980046 未加载

评论 #40979511 未加载

评论 #40979566 未加载

评论 #40980287 未加载

评论 #40982532 未加载

评论 #40980198 未加载

评论 #40981217 未加载

评论 #40981098 未加载

评论 #40980490 未加载

styczen10 months ago

not + -

booleandilemma10 months ago

Except he's not doing it for fun at all. He's doing it for clout and publicity on HN.

评论 #40988498 未加载