(Last updated: July 9, 2018)
Large integer multiplication is perhaps the most important part of any bignum or Pi program that intends to reach millions of digits. Nearly every non-trivial operation (division, square root, etc...) reduces to large integer multiplication.
Because of its importance, large multiplication is the most heavily optimized component in y-cruncher. Its interface along with all the supporting algorithms accounts for approximately 2/3 of the over 400,000 lines of code in y-cruncher.
As of v0.7.6, y-cruncher has the following set of algorithms and implementations:
|Algorithm||Code||First Appearance||Parallelized||Swap Mode||MRFM||Notes|
|Basecase Multiplication||BKT||v0.1.0||Janurary 2009||No||No||No|
Toom-3 is no longer used as of v0.6.1.
|32-bit Small Primes NTT||N32||v0.6.9||December 2015||Yes||No||No|
|64-bit Small Primes NTT||N64||v0.6.8||March 2015||Yes||Yes||Yes|
|Hybrid NTT||HNT||v0.1.0||Janurary 2009||Yes||Yes||No|
|VST Algorithm||VST||v0.6.1||February 2013||Yes||Yes||Yes||
|Code 17 Experiment||C17||v0.7.1||May 2016||Yes||Yes||Yes||
This set of algorithms has been accumulating since the start of the y-cruncher project. No processor architecture uses every single algorithm in this table. But every algorithm except for Toom-3 was useful for at least one processor line at some point in the history of the y-cruncher project.
The 3-letter codes are the ones that appear in the stress-tester and the swap mode status printing.
Internally, these algorithms are managed through a virtual method interface. This interface defines all the functions that a "multiplication algorithm" needs to have. Then each algorithm implements it. Instances of these objects are thrown into a global algorithm table with an activation threshold.
When a computation requests a large multiply, the framework walks the global table to select an algorithm object. Then it uses it to perform the large multiply. The overhead of this entire process is made negligible by short-circuiting small multiplications directly to the basecase algorithm.
This global algorithm table is customized on a per-processor basis. And there are separate tables for each of the computing modes since not all of the algorithms have implementations for the out-of-core computation modes (Swap Mode and MRFM).
Historically, the largest multiplication that y-cruncher has ever done have been part of the Pi world record computations. Each of these correspond to the "Final Multiply" step in the respective Pi computation. This "Final Multiply" is the only full-sized multiplication in the entire Pi computation.
|Size (N x N decimal digits)||Date Finished||Who||y-cruncher Version||Algorithm||Time|
|22.4 trillion||October 30, 2016||Peter Trueb||v0.7.1||VST Algorithm||72 hours|
|13.3 trillion||September 12, 2014||"houkouonchi"||v0.6.3||VST Algorithm||89 hours|
|12.1 trillion||December 14, 2013||Shigeru Kondo||v0.6.2||VST Algorithm||32 hours|
|10 trillion||September ??, 2011||Shigeru Kondo||v0.5.5||Hybrid NTT||86 hours|
|5 trillion||July 22, 2010||Shigeru Kondo||v0.5.4||Hybrid NTT||42 hours|
Going further back, there were many localized tests of large multiplications including the squaring of a 5.2 trillion digit integer in about 40 hours. But none of these tests were ever properly recorded - let alone in detail. The only certain detail is that they were all done using the Hybrid NTT since that was the only large-sized multiplication algorithm that y-cruncher had prior to 2012.
The largest multiplication that y-cruncher can theoretically do is (6.7 * 1017) x (6.7 * 1017) decimal digits. This is achieved with the 64-bit NTT using 5 primes and a transform length of 7 * 252 which provides a convolution length of 4.03 * 1018 bits. However, such an operation will require over 2 exabytes of memory which is near the limit of 64-bit addressing and is therefore beyond the reach of current hardware as of 2015.
The Basecase and Karatsuba algorithms handle the smallest products. Their implementations in y-cruncher are fairly standard, uninteresting, and inefficient. In the past, you needed to use assembly to make these algorithms efficient. But things are better now with compiler intrinsic support for both 64-bit multiply and add-with-carry.
Currently, the basic multiplication algorithms are not performance bottlenecks. Most of the run-time is spent in the FFT-based algorithms for large products. For this reason, very little effort has been made to optimize these basic algorithms.
In the past, y-cruncher also had an implementation of 3-way Toom-Cook. But it was never optimal to use on any processor and was removed after v0.5.5. Likewise, no attempt was ever made to implement higher or unbalanced Toom-Cook algorithms. Floating-Point FFT is so fast that it renders all of these more complicated methods completely obsolete. If current hardware trends continue, even Karatsuba multiplication will eventually disappear from future processors.
The Floating-Point FFT handles medium sized products from a few hundred digits up to what fits in CPU cache.
y-cruncher's implementation is a complex-to-complex FFT using right-angle convolution. It uses radix-4 and split-radix butterfly reductions and supports transform lengths of 2k, 3*2k, and 5*2k. The implementation is optimized for pure speed by means of SIMD and Fused-Multiply Add (if available).
As of 2015, the Floating-Point FFT remains the fastest multiplication algorithm in pure computational speed. It blows away every other algorithm, Karatsuba, high-order Toom-Cook, NTT... you name it. Nothing else even comes close. And the FFT will continue to get faster as it can take advantage of the ever widening SIMD.
As awesome as the FFT is, it has two major drawbacks: Memory consumption and round-off error. For large products, the FFT consumes too much memory and memory bandwidth. The memory cost is actually super-linear since the number of bits that can be placed in each point decreases for larger FFTs. And even when memory quantity isn't a problem, the pressure on memory bandwidth will kill off any parallel scalability.
Round-off error is the other issue and is the reason why most standard libraries avoid the algorithm completely. The FFT uses floating-point arithmetic where the results need to be correctly rounded to the integer. However, it is difficult to prove that the results will always be correctly rounded.
The most common alternative to the Floating-Point FFT is the Number-Theoretic Transform (NTT). NTTs are about an order of magnitude slower than FFTs, but they require only a fraction of the memory cost. So while FFTs are bandwidth constrained, NTTs tend to be compute-bound - which makes them good candidates for parallelism and out-of-core computation.
The classic approach to the NTT is to perform several NTTs over different prime numbers. The results are then recombined using the Chinese Remainder Theorem.
y-cruncher uses the classic approach with Victor Shoup's butterfly algorithm which requires no divisions.
y-cruncher's has two implementations of the NTT:
The 32-bit NTT is faster, but it has a size limit of about a billion bits. The 64-bit NTT is slower, but it has virtually no size limit.
The two implementations are identical in virtually every aspect other than the word-size. Both can use anywhere from 3 to 9 primes. And both support transform sizes of 2k, 3*2k, 5*2k, and 7*2k. This similarity is by design since they share much of the same code by means of template metaprogramming.
An early prototype of the 64-bit NTT can be found on my GitHub.
y-cruncher's NTTs are not optimized to be as fast as possible. Some sacrifices are made to reduce complexity and improve memory efficiency. For example, the NTT can be made faster by using 30 or 62-bit primes. But in the context of y-cruncher, this approach leads to unfortunate side-effects which are difficult to deal with.
The remaining three algorithms are unique to the y-cruncher project:
The HNT and VST were developed during my college days. The Hybrid NTT came in 2008 in an attempt to find an algorithm that would serve as a middle-ground in the horrific space-time tradeoff between the FFTs and the NTTs. The success of the Hybrid NTT is what ultimately led to y-cruncher itself.
The VST algorithm was a stupid experiment with SIMD in 2011 that eventually grew into a half-decent disk-swapping algorithm for swap mode computations.
The C17 algorithm is a nameless algorithm that was developed early in 2016 and added to y-cruncher v0.7.1.
As of this writing, all three algorithms remain unpublished and proprietary to the y-cruncher project.
For all practical purposes, these "novelty" algorithms are less important today than back in 2009 when y-cruncher first launched. Nowadays, these algorithms are becoming increasingly memory bound. So the quality of implementation with respect to memory usage matters more than the underlying math.
The use of floating-point FFTs in bignum arithmetic has always been a source of contention.
The last point is important. The very nature of floating-point algorithms is that they have round-off error. Using them to perform an exact operation requires that round-off error be kept at an acceptable level. In practice, the radix (# of bits per point) is chosen to be large as possible based on what is empirically found to be limit before round-off error. But the provable limits on round-off are much more pessimistic.
Implementations that use provable bounds instead of empirical bounds require approximately double the transform length. (2x computation, 2x memory)
This is a lot of performance to be throwing away. Therefore, many programs (PiFast, TachusPi, y-cruncher) decide to "take the risk" and use the empirical bounds.
That said, there are varying levels for which corners can be cut:
y-cruncher is in the middle category. All the settings are extensively tested to work on carefully chosen worst-case inputs. Then an extra safety margin is added.
In other words, there are no known inputs that would fail. But nevertheless, it is still not provably correct.
On the other hand, Prime95 from the GIMPS project would be in the last category. It relies on balanced representation and destructive cancellation to reduce the sizes of the coefficients - thereby allowing more bits to be crammed into each point. While this works for truly "random" inputs such as those in Mersenne prime testing, it is also easy to artificially construct inputs that would break the algorithm.
Unfortunately, the numbers encountered in y-cruncher aren't always "completely random". Some operations such as Newton's Method can produce values with many consecutive 1 bits - which also happens to be a worst-case input for some algorithms.
Theoretical vs. Practical: Mathematics vs. Engineering
From a mathematical perspective, use of anything other than the provable bounds is absolutely prohibited since the algorithm could be incorrect. But from an engineering perspective, other factors come to play.
The goal is to multiply large numbers correctly. But there a number of things that can affect the correctness of a large multiply:
Even with a correct algorithm, it is not possible to multiply large numbers with 100% reliability. Upon one analyzing these failure cases, it becomes obvious that an imperfect algorithm can be used without affecting reliability if other sources of error are much larger.
Even if you assume bug-free software (highly unrealistic), you can't do better than 1 error in 1017 due to the hardware. Using empirical bounds for FFTs adds a probability of error that is basically negligible compared to the 1 in 1017 rate of hardware errors.
Regardless of what the source of error is (software/hardware/algorithmic), redundancy checks can be used at minimal cost to catch errors.
Applying a modular identity over a 64-bit prime puts the probability of failing to detect an error at about 1 in 264. Ultimately, the scenario we want to avoid is an undetected error. Assuming bug-free code, the probability of an undetected error is 1 in 264 * 1017 ~ 1036. That's not likely to happen. If anything fails, it will be something stupid like a bug in the error-detection logic...
Needless to say, most bignum programs that use the FFT take the engineering approach - with or without the error-detection.
y-cruncher uses the error-detection with modulo 261 - 1 (a Mersenne prime) for reasons of performance.
The floating-point FFT isn't the only algorithm that suffers from this sort of round-off error. All of y-cruncher's proprietary algorithms (HNT, VST, and C17) also use
floating-point in ways that are difficult to prove correct for all inputs. So in that sense, all of these algorithms are useless for libraries that require provable correctness.
For that matter, even the NTTs (which are integer algorithms) have some shady floating-point usage in their CRT construction steps. But these uses are very localized. So while they haven't been rigorously proven to be correct, they have been carefully analyzed and deemed suitable for their purposes.
Large integer multiplication isn't the only place in y-cruncher that uses floating-point for otherwise integer tasks. This sort of stuff is littered all over the entire program. The most brutal example of this is in the Digit Viewer's binary-to-decimal conversion routine. This one has been proven correct for all inputs under the assumption of strict-IEEE floating-point behavior.
Out-of-core Algorithms (aka External memory algorithms) are algorithms that are meant to process data that is too large to fit in memory. In the current era, large number computations of trillion of digits are typically multiple orders of magnitude too large to fit in main memory. Therefore, out-of-core algorithms are critical to performing computations of these sizes.
Currently, the only form of out-of-core computation that y-cruncher supports is Swap Mode which uses disk storage as the global memory space and ram as a cache. This Swap Mode is what has been used for nearly all size world records set by y-cruncher.
Internally, y-cruncher has a newer out-of-core mode called, "Multi-Region Far Memory" or MRFM. This is part of a larger initiative to attack the NUMA-scalability problem.
As of 2017, y-cruncher has production-ready implementations of MRFM large multiplication. But since the rest of the program isn't MRFM-ready, none of it will see the light of day any time soon - if ever. Therefore, the rest of this section will focus on just the Swap Mode even though the methods are largely the same with MRFM.
Swap Mode FFT:
y-cruncher's Swap Mode multiply is just the standard recursive FFT algorithm - but on disk instead of ram. All the data resides on disk. And ram is only used as a cache for the disk. Since computation can only happen in ram, any data that needs to be touched is explicitly transferred from disk to memory and back.
Since disk is very slow, the key to performance is to reduce the amount of data that needs to be transferred to and from disk. Broadly speaking, these are some of the optimizations that are done to reduce disk access:
When all optimizations are put together, it often becomes possible to multiply a number by itself with only 2 read/write passes over the FFT domain data along with 1 read of the input and 1 write of the output.
Of course things aren't quite that simple. The number of passes over disk isn't the only thing that matters. The number of disk seeks is also important. For FFTs that are orders of magnitude larger than memory, the reduction size needed to do such a 2-pass FFT may result in a prohibitively large number of disk seeks. So there are trade-offs that need to be considered. The "Bytes/Seek" parameter in Swap Mode is the knob that tells y-cruncher how to handle this trade-off.
More recently, the CPU/memory-access performance gap has grown so large that some of these methods used by Swap Mode to minimize access to disk are now being used to minimize access to main memory.
Practically speaking, these optimized out-of-core FFT approaches are non-trivial to implement. While the total code size isn't as obnoxious as the SIMD intrinsic code that make up the kernels of the individual multiplication algorithms, there are numerous corner cases (such as carry-out) that need to be properly dealt with. And with multiple algorithms each having multiple modes, this can get very unwieldy from a development and maintenance standpoint.
So rather than having a dozen different out-of-core implementations, there are two generic ones: One for Swap Mode and one for MRFM.
Each of these generic implementations are completely built on top of a set of primitives. So any multiplication algorithm that provides these primitives will get a fully working out-of-core implementation.
Reuse of redundant FFT transforms is an optimization that seeks to optimize operations that multiply by the same number more than once.
Recall that an FFT-based multiply consists of transforming each of the two operands, pointwise multiplying, and inverse transforming the result.
A * B:
There are 3 transforms. (2 forward, 1 inverse) Now if we wanted to do two multiplies...
A * B, A * C:
That's 5 transforms. (3 forward, 2 inverse) We have saved 1 transform. Assuming an algorithm with trivial pointwise convolution, this is a 16.6% savings.
This particular optimization is applicable to most Binary Splitting recursions as well as Newton's Method iterations and radix conversions.
In practice, the savings is slightly less since inverse transforms tend to be more expensive than forward transforms due to various overheads such as carryout.
Reusing transforms may sound like a decent optimization, but it can be very difficult to implement.
These are pretty big barriers, so few programs actually do it. y-cruncher couldn't do it prior to v0.6.1.
Older versions of y-cruncher (v0.5.5 and earlier) encapsulated multiplication. Because of all the different multiplication algorithms, breaking through this encapsulation to reuse transforms was simply not feasible.
y-cruncher v0.6.1 saw a complete redesign and rewrite of the core algorithms and representation. This time, a transform-only interface was added. All multiplication algorithms were required to implement this interface. For algorithms that didn't use transforms (Basecase/Karatsuba), the interface was faked with some overhead.
The transform-only interface allows high-level mathematical code to reuse redundant transforms without breaking any encapsulation.
Furthermore, it also allows other things that couldn't be done with the old interface: