y-cruncher - Frequently Asked Questions

(Last updated: February 23, 2024)

Back To:

Pi and other Constants:

Where can I download the digits for Pi and other constants that are featured in the various record computations?

As of this writing, Google provides access to the first 31.4 trillion digits of Pi. But this is likely to be temporary.

Historically, there is no reliable place to get the digits. Due to the large sizes of the data, it simply isn't feasible to move them around.

Personally, I have an archive of some of the digits including the first 22.4 trillion digits of Pi. But because I'm on a consumer internet connection with rate and possibly bandwidth limits, it simply isn't possible for me to upload them. When I was still in school, I was able to seed some torrents since university connections are very fast. But that isn't possible anymore.

To answer the question directly, your best bet at getting the digits is to:

Compute them locally using y-cruncher if you have the resources to do so.
Contact the person who ran the computation and see if they still have the digits. (In all seriousness, this probably won't work since everyone seems to just delete the digits after the computation since they take too much space.)

Under extreme circumstances (if you're a famous professor trying to do research), I may make special arrangements to run research code locally on my machine. But this has happened only once and I was dealing with some pretty amazing professors which I didn't want to let down.

It's worth noting that Bittorrent is a viable way to distribute large files like the digits of Pi. And this is something I actually did for about 2 years while I was in school. But I can't really do that anymore because US-based consumer ISPs suck and they will come after you if you use too much bandwidth.

Can you search for XXX in the digits of Pi for me?

In short no. Not because I don't want to, but because I can't. 22.4 trillion digits is a lot of digits. It takes several days just to read all of it. So searching requires too much time and computational resources. I'm not Google and I don't have the ability to search index something that large.

Can you add support for more constants? I want to compute Khinchin, Glaisher, etc...

In the past, the answer was no. But starting from v0.7.7, the custom formula feature will let you build formulas which y-cruncher can evaluate to high precision.

However, the set of functionality in the custom formulas is quite limited. And as a general rule, everything must run in quasi-linear run-time and linear memory. Thus you will not be able to run hard to compute things like Glaisher and Khinchin's Constant as there are no known algorithms for them that meet the above criteria.

I have found an amazing new algorithm for X that's better than the rest!

This is a loaded question that I get more often than I expect. I used to do these in private over email, but there's now a subforum for y-cruncher where these can go. And now, starting from y-cruncher v0.7.7, the custom formula feature will allow some algorithms/formulas to be tested and benchmarked with y-cruncher's full capability.

The computation of Pi and other constants is a pretty well-studied topic that has already been attacked by generations of mathematicians. So rarely are there things that are genuinely new. Likewise, the bar is set very high for anything new to be a "breakthrough" or to make it into y-cruncher.

The majority of the inquiries I get will fall into these categories:

The author is either trolling or has no idea what he/she is talking about.
The algorithm is either just an idea or is incomplete.
The algorithm works, but is not any better than the existing algorithms.

Only once has something gotten all the way through. And it led the new algorithms for Catalan's Constant in v0.7.7.

#1 is relatively rare and I typically just ignore them.

#2 is more common. Depending on the merits of the proposed idea/algorithm, it may lead to a long and interesting discussion. But so far it has yet to lead to anything ground-breaking.

#3 is the most common. Often times people will present an algorithm that works and looks fast on paper. But a deeper analysis reveals that it's either unworkable or is not anywhere near as efficient as predicted.

Here are the common pitfalls that lead to #3:

The algorithm is fast for a small number of digits, but not fast for very high precision. (i.e. Run-time Complexity)

The speed of a series is not determined by just the convergence rate. Just because a series for Pi converges at 100 digits/term doesn't make it faster than Chudnovsky's formula. If it did, then you can trivially combine adjacent terms of Chudnovsky's formula to make it faster.

The series is impressive, but it's difficult or not possible to apply the Binary Splitting algorithm. This is usually the case for non-hypergeometric series or series involving irrational or transcendental numbers.

For irrational x, Sin(x) and Cos(x) are harder to evaluate than Pi itself. So those iterative algorithms involving Sin(x) that converge exponentially to Pi aren't viable. In fact, the fastest known high-precision algorithm for computing Sin(x) and Cos(x) requires the computation of Pi itself!

This doesn't just apply to trig functions. The take away here is to be careful of what functions you need. Many of them are harder to compute than Pi itself.

Program Usage:

What's with the warning, "Unable to acquire the permission, "SeLockMemoryPrivilege". Large pages and page locking may not be possible."?

This warning shows up when y-cruncher doesn't have permission to use large pages.

Large pages is an optional feature that may improve performance especially in the presence of KPTI for Meltdown mitigation. Since it is optional, y-cruncher can run correctly without it. But it will print out a warning if it can't enable it.

Refer to the memory allocation guide on how to turn on large pages. Be aware that even if you make this warning go away, it doesn't mean that large pages are actually working. Details are in the linked guide.

What's the difference between "Total Computation Time" and "Total Time"? Which is relevant for benchmarks?

"Total Computation Time" is the total time required to compute the constant. It does not include the time needed to write the digits to disk nor does it include the time needed to verify the base conversion. "Total Time" is the end-to-end time of the entire computation which includes everything.

The CPU utilization measurements cover the same thing as the "Total Computation Time". It does not include the output or the base convert verify.

For benchmarking, it's better to use the "Total Computation Time". A slow disk that takes a long time to write out the digits will affect neither the computation time nor the CPU utilization measurements. Most other Pi-programs measure time the same way, so y-cruncher does the same for better comparability. All the benchmark charts on this website as well as any forum threads run by myself will use the "Total Computation Time".

For world record size computations, we use the "Total Time" since everything is relevant - including down time. If the computation was done in several phases, the run-time that is put in the charts is the difference between the start and end dates.

There's currently no "standard" for extremely long-running computations that are neither benchmarks nor world record sized.

Why does y-cruncher need administrator privileges in Windows to run Swap Mode computations?

Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

Linux doesn't have this problem since the major filesystems (ext4) supports sparse files and extents which keep track of which parts of a while are valid and which are not. Needless to say, this comes performance overhead that is not present on Windows. Unfortunately for Linux, there is no way achieve the same effect as SetFileValidData() even with full root privileges.

As a result, while Swap Mode in Windows requires administrator privileges, it is also more performant than on Linux.

Further Reading: https://learn.microsoft.com/en-us/archive/blogs/bobsql/sql-server-instant-file-initialization-setfilevaliddata-windows-vs-fallocate-linux

Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!

For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

For those who prefer academic terminology, y-cruncher has weak scalability, but not strong scalability. For a fixed computation size, is it not possible to sustain a fixed efficiency while increasing the number of processors. But it is possible if you increase the computation size as well.

I've found a bug! How do I report it? Can you fix it?

If you think you've found a bug, please report through one of these channels:

Open a GitHub issue.
On the Mersenneforum subsection. (new!)
Directly to me via email.

This will look a lot more formal than it really is, but I generally categorize bugs into 5 levels: P0 - P5 where lower number indicates higher severity.

(The naming of which was taken from my time at Google.)

Priority	Description
P0	May cause a world record computation of a built-in constant to finish with the wrong digits - undetected. (assuming reliable hardware) Impact: Announcement of a new world record when it really isn't. Hypothetical Examples: Incorrect formulas, or dependent formulas with the same error being used for both computation and verification. Bugs in the radix conversion and digit-output in combination with a failure of error-detection. By design, there are almost no single points of failure that can lead to a P0 even on unreliable hardware. And to date there has never been a single P0 in a publicly released version of y-cruncher.
P1	May cause a large computation of a built-in constant to either finish with the wrong digits or unrecoverably stop after significant progress is made. (assuming reliable hardware) Impact: Significant loss of computational time and/or resources. Real/Hypothetical Examples: Checkpoint restart bugs that prevent a checkpoint from being written properly. Bugs in the algorithm. This has happened 3 times so far: v0.7.6: There was a gap in checkpoint-restart coverge near the end of Euler-Mascheroni Constant computations. v0.6.3: A checkpoint restart bug almost derailed Shigeru Kondo's 12.1 trillion digit Pi computation. Fortunately, he caught it before any irreparable damage could be done. v0.5.5: There was a bug in the implementation of ArcCoth() that lead to incorrect computation of Log(2), Log(10), and Euler's Constant.
P2	May cause a large computation of a major constant to stop after significant progress is made. Recovery is likely, but will require developer intervention. Impacts: Possible loss of computation after significant time/resource investment. Developer intervention may be significant and require unproven/untested hacks. Real/Hypothetical Examples: Checkpoint restart bugs that prevent a checkpoint from being resumed. Bugs involving the radix conversion and digit output. Bugs that result in significant underestimation of disk requirements. Bugs in the computational plan. Bugs in the large number arithmetic. Significant lapses of error-detection coverage for hardware errors. There have been about 16 of these so far in the history of y-cruncher. The count is fuzzy due to spotty record-keeping early in the project.
P3	Examples: Any bug that may cause any computation to either not finish or finish with the wrong digits. (assuming reliable hardware) Large unintended performance regressions for large computations. Large unintended increases in memory/disk usage for large computations. Any significant UI or configuration bug that causes loss of or incorrect functionality of an important feature. Most custom formula bugs fall into this category no matter how severe they are. Likewise all bugs that cannot derail a large computation are generally capped to this severity. This includes all the secondary features such as the stress-tester.
P4	Examples: Small unintended performance regressions for large computations. Small unintended increases in memory/disk usage for large computations. UI and configuration bugs that cause minor loss of functionality. Misleading error messages.
P5	Examples: UI and configuration bugs that do not cause any loss of functionality. Improper or unexpected handling of invalid user-input. Confusing and/or useless error messages.

Notes:

P0, P1, and P2 are the "emergency class" bugs. They generally warrant an immediate fix and unscheduled release.
P3's and P4's are usually saved up until there is enough of them. Then they are all fixed at once in a single release/patch.
P5's are only fixed in a patch if it is easy to fix without side-effects. Otherwise, they are either deferred to the next feature release or not fixed at all.
The vast majority of bugs are P3's and P4's.

Hardware and Overclocking:

My computer is completely stable. But I keep getting errors such as, "Redundancy Check Failed: Coefficient is too large."

The notorious "Coefficient is too large" error is a common error that can be caused by many things. A full technical explanation is here.

Because of the nature of error, it can be caused by literally anything. But below are the two most common causes.

The hardware is not "AVX stable" or "AVX512 stable":

If your "stable" overclock is hitting the "Coefficient is too large" error on the standard benchmark sizes, then your overclock is not as stable as you think it is.

This error is most commonly seen on overclocks with an unstable AVX or AVX512 configuration. It first became common on Haswell, but is now seen on nearly all Intel processors - especially those with AVX512.

y-cruncher makes heavy use of AVX and AVX512 instructions if the processor supports them. So in order to successfully run a y-cruncher benchmark, your system needs to be stable with these heavier workloads. The problem is that the vast majority of programs don't use AVX. So many "rock-stable" overclocks are actually not stable when running AVX. To make matters worse, most of the early BIOS versions for Skylake X improperly implemented the AVX/AVX512 offsets - thus causing systems to be unstable even at stock settings!

If you search around overclocking forums, you'll find that there are numerous complaints about Prime95 and other AVX workloads being "unrealistic". And for that reason, many overclockers will skip these stress-tests. While this allows for significantly higher overclocks, it sacrifices stability for AVX-optimized applications. So it's common for overclocked systems to be perfectly stable for months, and then immediately crash and burn when attempting to run Prime95 or y-cruncher.

If you fall into this category, lower your overclock. Recent processors (since Kaby Lake) have AVX and AVX512 "offsets" which will automatically downclock the CPU under those respective workloads. Use them! They will let you keep your very high non-AVX overclock without sacrificing stability for AVX and AVX512 workloads.

While y-cruncher isn't quite as stressful as latest Prime95, the workload is very similar. So if you cannot pass Prime95 small FFTs (with AVX) for at least a few seconds, you stand no chance of running any large benchmark or computation with y-cruncher.

AMD processors (up through Zen+) don't have this problem since they don't have native AVX hardware anyway. This may change with Zen 2.

Memory Instability:

The other common cause is memory instability. y-cruncher has a reputation of being notoriously stressful on the memory subsystem. If you read around overclocking forums, there are countless reports of y-cruncher being able to uncover subtle memory instabilities where all other applications and stress-tests will pass.

This doesn't just happen with overclocking. It has been observed on server hardware as well. To date, I've had 4 bug reports on server hardware:

1 was due to a loose memory module.
2 were fixed with a BIOS update or rollback.
1 is still under investigation.

In all 4 of these cases, the users reported that y-cruncher is the only application that managed to fail. Personally, I've had one error on server hardware believed to be caused by an overheating northbridge.

Naturally, one would do a double take since server hardware has ECC memory. But ECC only protects against bit flips in the memory. It doesn't guard against instability in the memory controller or general memory incompatibilities.

It's worth noting that y-cruncher was never designed to be a memory stress-test. And to date, it is still unclear why it often seems to be better at testing memory than even dedicated tests such as MemTest64.

Disk I/O Instability:

In swap mode computations, disk I/O instability is a major cause of errors. Most storage errors are "visible" (such as CRC errors, unreadable sector, drive dismounted, etc...). But a sizable portion of I/O errors are silent and only get caught by y-cruncher's redundency checks.

Silent disk I/O errors tend to be more common on old and aging hard drives. But overall, there isn't enough data to point out any specific causes. Swap mode isn't a heavily used feature. And those that use it tend to have more stable systems. (The overclocking community doesn't really use swap mode.)

~~Why is y-cruncher so much slower on AMD processors than Intel processors?~~

A lot has changed since this was originally written. The situation has completely flipped in 2022 with AMD now beating Intel.

Therefore, this section is very much out-of-date.

Original Answer (circa 2010-2022):

AMD has historically lacked behind in Intel for y-cruncher - sometimes by very large margins. As of Zen 2, AMD reached parity with Intel for desktop chips but is still behind against Intel's server products. What's going on here? Is y-cruncher another one of those "Intel-biased" benchmarks?

Short Answer:

It boils down to the raw SIMD throughput. Intel processors have 256-bit and 512-bit wide vector units while AMD processors are only 128-bits wide.

Intel processors can natively run 256-bit and 512-bit AVX instructions.
Older AMD processors need to split 256-bit instructions into a pair of 128-bit ops and cannot run 512-bit AVX instructions at all.
New AMD processors (Zen 2) can now run 256-bit instructions natively. But they still lack 512-bit support.

y-cruncher is one of the few applications that can utilize the AVX to get the full benefit of the vector units. This gives Intel processors an advantage.

Long Answer:

Here's a data dump of the performance throughput of various vector operations. Higher is better.

Pink entries mark data that is not yet known. Data in them are merely educated guesses.

Hardware

Throughput Per Cycle

Processor

Year

Unit

ISA

Vector Width

Vector Units

Integer

Floating-Point (Double Precision)

Shuffle

Logic

Add

Mul

Add

Mul

Add or Mul

FMA

Intel Sandy Bridge

2011

1 core

AVX

256 bit

3 x 128

2 x 128

1 x 128

1 x 256

2 x 256

2 x 128

1 x 256

Intel Ivy Bridge

2012

Intel Haswell

2013

1 core

AVX2

256 bit

3 x 256

2 x 256

1 x 256

2 x 256

1 x 256

Intel Broadwell

2014

Intel Skylake

2015

1 core

AVX2

256 bit

3 x 256

2 x 256*

1 x 256

Intel Kaby Lake

2017

Intel Coffee Lake

2017

Intel Skylake Purley

2017

1 core

AVX2

256 bit

AVX512

512 bit

2 (1 x FMA)

2 x 512

1 x 512*

1 x 512

2 (2 x FMA)

2 x 512*

Intel Cannon Lake

2018

1 core

AVX2

256 bit

3 x 256

2 x 256*

1 x 256

AVX512

512 bit

2 (1 x FMA)

2 x 512

1 x 512*

1 x 512

Intel Ice Lake

2019

1 core

AVX2

256 bit

3 x 256

2 x 256*

2 x 256

AVX512

512 bit

2 (1 x FMA)

2 x 512

1 x 512*

1 x 512

2020

2 (2 x FMA)

2 x 512*

Intel Knights Landing

2016

1 core

AVX512

512 bit

2 x 512

2 x 512*

1 x 512

Intel Knights Mill

2017

1 core

AVX512

512 bit

1 x 512

AMD Bulldozer

2011

1 module

AVX

128 bit

2 x 128

1 x 128

2 x 128*

1 x 128

AMD Piledriver

2012

AMD Steamroller

2014

AMD Excavator

2015

AVX2

AMD Zen

2017

1 core

AVX2

128 bit

4 x 128

3 x 128

1 x 128

2 x 128

4 x 128

2 x 128

AMD Zen 2

2019

1 core

AVX2

256 bit

4 x 256

3 x 256

1 x 256

2 x 256

4 x 256

2 x 256

AMD Zen 3

2020

1 core

There really isn't much that needs to be said. Intel chips currently have much better SIMD throughput.

*These processors appear to have monolithic FMA hardware that can do all floating-point arithmetic with the same latency and throughput. In the case of Intel processors, this also includes integer multiplication.

Why is y-cruncher faster on AMD processors than Intel processors?

This is obviously the reverse of the previous QA. Why is AMD now faster than Intel?

Intel used to destroy AMD in y-cruncher. Why has the situation reversed?

The short answer can be broken down into the following:

Starting with Alder Lake, Intel has dropped support for AVX512 on their client CPUs while AMD has added it in Zen4.
Intel has moved in the direction of hybrid computing with E-cores while AMD has stuck with full cores. y-cruncher runs terribly on E-cores.
AMD (as of 2023) overall has the more efficient processor partially thanks to its node advantage with TSMC 5nm vs. Intel 7.

But how did we get here?

In short, a combination of rather unfortunate events on Intel side combined with AMD's relentless improvements has flipped the tables.

While AMD struggled during the Bulldozer era (2011-2017), Intel enjoyed a period of no competition. So it became complacent. So during those years, there was very little improvements on Intel's side. We had nearly a full decade of quad-core desktop processors with minimal IPC gain simply because Intel had no competition to make them ship a better product.

AMD being the underdog underwent lots of internal changes and focused on building a new and better product (Zen). So while AMD was working to improve their product, Intel remained stagnant.

AMD's initial Zen processors weren't competitive with Intel yet. In short, AMD had a LOT of ground to make up. But when AMD started shipping Zen chips, they shipped 8-core chips while Intel was still shipping 4-core chips. AMD knew they still couldn't match Intel in per-core performance, but they could hit Intel where they had stagnated - increasing core counts.

As AMD continuing improving their product, Intel realized that they were losing their lead. So they started getting back to work. But this is where they ran into a new problem: Their 10nm process wasn't working. Not only was Intel's CPU architecture lead evaporating, they began to lose their process lead as their inability to ship 10nm opened the door for TSMC (which makes AMD's chips) to take the crown.

When AMD shipped Zen3 in 2020, it marked the point where Intel finally fell behind AMD. Not only had AMD surpassed Intel in multi-core performance (because it had more cores), it was also beating it in single-threaded performance. Intel has basically lost everything except for its brand. No, they didn't just let this happen. They flat out couldn't do anything about it because their 10nm process was broken and completely screwed up their development cycle.

Intel's Alder Lake CPUs (2021) would be able to regain the single-threaded performance lead over AMD, but at the cost of efficiency. Unlike AMD, Intel couldn't easily slap 16 of them on a desktop chip to compete with AMD and they knew this. So the only way to compete with AMD in both single-threaded and multi-threaded performance was to build two separate cores aimed at each one separately. P (performance) cores and E (efficiency) cores. Intel would place a small number of P-cores which "sacrifice everything" to beat AMD in single-threaded performance. Then they would add a large number of E-cores that also "sacrificed everything" to beat AMD in multi-threaded performance. In short, Intel couldn't build a core that could beat AMD in both categories. So instead, they made 2 separate cores specialized for each category.

With the history out of the way, we can get back to y-cruncher...

Intel, having lost its advantage over AMD found that the only way it can remain competitive with AMD is to "do weird things" which nobody wanted. I'm specifically referring to the splitting of P-cores and E-cores. This hybrid core design, while it makes Intel chips look good in certain benchmarks, is absolutely horrible to software developers. How the hell do you optimize for this?

To make matters worse, E-cores are "efficient" for a reason. They do the "bare-minimum" to run normal applications well. But they lack all the accelerators that make HPC applications like y-cruncher run well. AVX is halved in performance. No AVX512 at all. And because of the need for ISA parity across all cores, lacking AVX512 on the E-cores means it must also be disabled on the P-cores. So this loss of AVX512 combined with terrible performance of the E-cores destroys y-cruncher's performance on Intel's desktop processors.

The situation doesn't look like it will change in the near future. Intel is in big trouble right now which is why it is resorting to these tactics. In all likelihood we may be entering a role-reversal of the 2010s where Intel will only be able to catch up if AMD becomes complacent.

Academia:

Is there a version that can use the GPU?

Currently no. But if there were one:

It is conceivable that a GPU version can achieve better performance for computations that stay within GPU memory.
It is uncertain that a GPU version can beat the current CPU code for computations that fit into CPU memory, but are too large to fit into GPU memory.
A GPU version will not provide substantial benefits for record setting.

A GPU implementation must overcome several hurtles before it can challenge a modern CPU.

The GPU must beat the CPU in at least one of 3 workloads.
GPUs require massive vectorization. Large number arithmetic code is not that easy to vectorize.
Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication.

The GPU must beat the CPU in at least one of 3 workloads.

The GPU must beat the CPU in at least one of these categories by a wide margin. Furthermore, it cannot suck too much in the others.

IEEE-compliant double-precision.
64-bit integer multiplication with both lower and upper half 64-bit parts.
Carry-propagation of large integers.

These correspond to the 3 different classes of large integer multiplication algorithms that are suitable for high-precision arithmetic. In the past, any of the 3 was sufficient for record setting. But in the modern era, #3 alone is no longer adequate. And we are within an order of magnitude of hitting the limit of #1-based algorithms. Thus #2 will be required in the long term.

As of 2018, all 3 of these operations are efficient on CPUs. None of them are efficient on consumer grade GPUs. But #1 is efficient on high-end HPC-oriented GPUs.

To be clear, single-precision is not viable. Large number arithmetic isn't like deep-learning where you can get away with arbitrarily small precision. There is a genuine need for high-precision datatypes. If they don't exist or are inefficient, they must be emulated.

With #1 becoming inadequate soon (~10 years?), the window for GPUs to win here seems to be closing.

GPUs require massive vectorization. Large number arithmetic code is not that easy to vectorize.

GPUs require massive vectorization or otherwise very uniform (and independent) computation with little branching. Unfortunately, there are some important things in bignum arithmetic that don't fall into this category.

Carry-propagation implies a dependency chain which makes it difficult to both vectorize and parallelize.
Small-number arithmetic with irregular number sizes doesn't easily vectorize since the data is non-uniform. But it does parallelize. This is the case for Binary Splitting recursions.

The only thing that's really vectorizable in bignum arithmetic are the FFT-based algorithms for large multiplication. But even that is not without complications. Certain (important) memory-related optimizations will place limits on how large the SIMD vector can be. AVX512 is already pushing at these limits in the context of Skylake X processors. It's possible this problem will be worse on GPUs. More investigation is needed, though it shouldn't be a problem since there already exist GPU FFT implementations that outperform CPUs.

It's worth mentioning that Intel's Xeon Phi (co)processor line was interesting for some time as it was (in some sense) a GPU with CPU instructions. But initial benchmarks on Knights Landing were somewhat disappointing. Intel has since refocused the entire line away from general HPC and specialized it for deep-learning. In doing so, they have also nerfed the operations that are important y-cruncher. Thus the Xeon Phi line is no longer useful nor interesting for y-cruncher.

Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication.

This one is the most serious problem that is the final nail in the coffin for current generation GPUs.

GPUs, when properly utilized, can provide an amazing amount of performance - much more than CPUs. But the problem is that computation isn't even the bottleneck. The bottleneck is memory bandwidth and data communication in general. In other words, throwing GPUs at the problem isn't going to help.

How bad is the communication bottleneck?

y-cruncher is very close being memory bandwidth-bound on desktop CPUs with DDR3/4 memory.
y-cruncher is extremely memory bandwidth-bound on multi-socket servers with NUMA topology.
PCIe bandwidth is only a fraction of the CPU-memory bandwidth.
Disk I/O is only a fraction of PCIe bandwidth.

y-cruncher is already memory-bound on the CPU. Put it on a GPU over the PCIe, and it gets worse. So much for all that computing power! The solution here is to use the (large amount of) onboard GPU memory as a cache for the main memory in a manner similar to how y-cruncher currently uses main memory as a cache for disk. But this is an additional level of design complexity that will not be easy to do.

Let's assume we do that anyway. Yay! The GPU crushes the CPU for in-memory computations and we get some amazing benchmarks. But that doesn't help you in Swap Mode where you're running off the disk. And the disk sure as hell won't be faster than PCIe - even if you're saturating the system with NVMe drives. (Not that SSDs are a great idea anyway...)

To put this into perspective, I often get asked about how long it would take to set a record for Pi (or some other constant) using X hardware. I typically tell them something on the lines of:

"Given infinite computing power, it will take Y months with your storage configuration".

In other words, computation doesn't even matter anymore. It's all about the disk I/O and the storage. The most economical way to set a record nowadays is to get a machine with the fastest storage configuration, largest memory configuration, and... a mediocre CPU.

The "fastest storage configuration" because that's the bottleneck.
The "largest memory configuration" because it minimizes the amount of disk I/O that is needed.
A "mediocre CPU" because that's all you need before you hit the disk bottleneck.

Why can't you use distributed computing to set records?

No for more or less the same reasons that GPUs aren't useful.

Just as with GPUs, computational power is not the bottleneck. It is the data communication. For this to be feasible as of 2024, everyone would need to have an internet connection speed on the order of 10 - 100 GB/s. That's not a typo. I do mean 10-100 gigabytes (not gigabits). Anything slower than that and it's faster to do it on a single computer.
Computing a lot of digits requires maintaining a very large dataset with no tolerance for data loss or corruption. This presents some difficult problems:
1. Such a large datasize is too large to fit onto any single machine. So it would need to be distributed across many participants.
2. There needs to be enough redundancy to handle participants leaving the network and taking with it all the data that it holds.
3. Achieving Byzantine Fault Tolerance would be quite difficult with such large dataset.

Follow up question: "I don't understand. If you're computing a trillion digits, why can't you just assign different parts of it to different computers. Then they can all run in parallel and be combined at the end."

Layman Answer: That's not how it works. Computing the digits of Pi is like building a skyscraper. You cannot just assign different floors to different contractors to build at the same time and combine them at the end. You need to finish each floor before you can build the floor above it. The only way to parallelize is to have the different contractors work together within each floor. In other words, the parallelism is horizontal, not vertical.

Technical Answer: Computing all the digits of Pi from 1 to N in an efficient manner is a coarse-grained parallelizable task. At the very top level, it is not parallelizable at all. All the parallelism is at the lower levels. Therefore, communication between all workers is very frequent - enough to become a bottleneck.

There exist algorithms like BBP to directly compute arbitrary binary digits without the memory cost of computing all the digits before it. These are called "digit-extraction" algorithms. However, these algorithms require roughly O(N*log(N)) time to compute a small number of digits at offset N. Using this approach to compute all the digits from 1 to N will result in a quadratic run-time algorithm. This alone makes it unsuitable for large N.*

To make things worse, the currently known digit-extraction algorithms for bases other than binary are much slower. And a radix conversion runs into the same all-to-all communication problem as the current methods to compute Pi.

*While there exists some potential ways that can make the algorithm sub-quadratic, they haven't been researched because they don't solve the problem of the radix conversion.

Is there a distributed version that performs better on NUMA and HPC clusters?

No, but it is a current research topic.

Starting from v0.7.3, y-cruncher has had the ability to detect the NUMA configuration and automatically interleave the memory across all the nodes. This goes a long way to countering the interconnect traffic imbalance that results from the operating system first-touch NUMA policies. But nevertheless, this isn't a scalable solution.

Maintaining scalability across larger and larger systems will require the program and algorithms to be developed specifically for NUMA. The MRFM project will attempt to do this. But this hasn't made much progress over several years and there is no end in sight.

Why have recent Pi records used desktops instead of supercomputers?

While the rest of the world is trending towards more parallelism, computations of Pi seems to have gone the other way.

The only recent Pi record which has "gone the other way" is Fabrice Bellard's computation of 2.7 trillion digits back in 2009. That was the major leap from supercomputer to... a single desktop computer. But since then, all the records have been done using increasingly larger (commodity) server hardware. Nevertheless, the hardware used in these computations are still pretty far removed from actual supercomputers.

So the real question is: Why aren't we using supercomputers anymore?

Unfortunately, I don't have a good answer for it. Sure y-cruncher has been dominating the recent Pi records using single-node computer systems. But that doesn't explain why nobody from the supercomputer field has joined in. Some possible contributing factors are:

The performance gap between CPU and memory has grown so large that perhaps supercomputers simply cannot be efficiently utilized. The recent Pi computations using single-node desktops and servers had disk bandwidth on the order of gigabytes per second. Supercomputer interconnects are generally slower than that.

Supercomputers have more useful things to do. So it's probably more economically viable to use commodity hardware and run for several months than to tie down a multi-million dollar supercomputer for even a few days. Prior to 2010, there were no known desktop programs capable of efficiently computing Pi to that many digits. So it wasn't even possible to run on commodity hardware unless you wrote one yourself.

Supercomputers are generally inaccessible to the public. On the other hand, everybody has a laptop or a desktop. So the pool of programmers is much larger for commodity hardware than supercomputers.

It's easier to program for desktops than supercomputers. This makes it possible to implement more complex programs which could otherwise be prohibitively difficult for supercomputers. This is probably why the majority of the supercomputer Pi records were done with the AGM method instead of the series methods. The AGM is a lot slower, but it's also much simpler.

Why do desktops use the Chudnovsky formula while supercomputers use the AGM? Isn't the AGM faster?

This is a loaded question that has different long answers depending on the audience.

Mathematical Perspective:

The common misconception is that the AGM is faster because it converges much faster. Only a logarithmic number of iterations is needed for the AGM. By comparison, the series algorithms such as Chudnovsky only give a fixed number of digits per term. Thus they require a linear number of iterations.

However, not all "steps" are equivalent. An AGM iteration is not the same as a series "term". The amortized cost of summing up a series term is much cheaper.

To oversimplify things a bit, the series algorithms are fast because the Binary Splitting algorithm allows you to simultaneously sum up many terms at once. While intuitively hard to believe, this speedup is enough to overcome the slower convergence of the series algorithms relative to the AGM. In the end, it becomes faster to sum up trillions of series terms than it is to iterate 40+ AGM iterations.

Computer Science Perspective:

Using modern algorithms, the AGM algorithms run in O(N log(N)²) time. The series algorithms run in O(N log(N)³) with the Binary Splitting method. So the same question still applies. Why does everybody use the Chudnovsky formula when it is slower than the AGM?

While the AGM has a lower run-time complexity, it has a much larger Big-O constant. And because the difference between the two is only a logarithmic factor, it will require a very large N to overcome the difference in Big-O. At the current sizes (trillions of digits), that N is still not large enough.

To make matters worse, the AGM algorithms have very poor memory locality. So when faced against the modern memory wall, the gap between series and AGM gets even wider. Current trends in computing have this gap growing wider at a rate that outpaces the logarithmic factor difference in run-time complexity.

Engineering Perspective:

So that leaves the final question: Why do current computations use the Chudnovsky formula while the supercomputer computations of the 90s and 2000s use the AGM?

While the series algorithms are faster, they are also much more difficult to implement. So once you factor in software development costs, it becomes more practical to use a suboptimal algorithm and allow the sheer power of the supercomputer to overcome any computational inefficiency that comes with the slower algorithm.

Generally speaking, desktop applications are much easier to develop than supercomputer programs. At the very least, "everybody" has a PC sitting on a desk at home that they can play with all to themselves. But not everybody has access to a supercomputer. While this is technically no longer true thanks to cloud computing services, the costs of "renting a supercomputer" off a cloud service is prohibitive to most people.

Programming:

What is the technical explanation for the notorious, "Coefficient is too large" error?

The "Coefficient is too large" error is one of many redundancy checks which y-cruncher uses. To understand what this redundancy check is at the technical level, it helps to have a basic understanding of large multiplication via Fast Fourier Transform (FFT).

In this algorithm, each of the input operands are broken up into coefficients of a polynomial. FFT is then used to multiply the polynomials together. Based on the size of the input polynomial and its coefficients, it is possible to prove certain limits for the coefficients of the output polynomial.

y-cruncher has a redundancy check that checks all the coefficients to make sure they are all less than the limit. If any are above this limit, then it knows that a computational error has occurred and it throws the "Coefficient is too large" error.

The reason why the "Coefficient is too large" error is so common is a combination of various factors:

The coefficient check has near complete coverage of all the computational work in a large multiplication. An error that occurs anywhere in the algorithm except for the final carryout will have an extremely high probability of completely corrupting some of the coefficients.

Almost all the computational run-time is spent performing large multiplication with FFTs. Therefore the probability that an error happens to be in a large multiplication is very high.

y-cruncher uses this coefficient check because it has high coverage with minimal computational overhead. But this high coverage also makes the error almost useless in helping to track down the source of the error since it's basically the same as, "something went wrong".

Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?

Technically yes, but it's an abandoned project that's no longer maintained.

Internally, y-cruncher changes far too much and far too quickly for any arithmetic-level interface to be sustainable in the long run. Past experience has shown that regardless of what interface is designed, it quickly becomes out-of-date within a year. And within 2 years, the internals will have changed in some fundumental way that makes it no longer possible to implement the interface. In other words, backwards compatibility is thoroughly destroyed every few years.

y-cruncher has always been evolving like this with constant and intrusive changes not just to the internal implementations, but also the APIs and abstraction layers that hold the program together. The only thing that doesn't change is the UI - which has looked about the same since the colors were added back in 2009. In the end, I've come to realize that anything that resists change (such as a public interface) will inevitably hinder the development of the project.

Is y-cruncher open-sourced?

For all practical purposes, y-cruncher is closed source. Only about 5% of the code is open-sourced and available to see on GitHub under varying licenses.

The 5% includes the Launcher, the Digit Viewer, and their dependencies. Also, a very out-of-date version of the object library headers can be found in the abandoned NumberFactory side-project.

None of the mathematical or computational code is open-sourced. So if that's what you're looking for, you will not find it.

If you are interested in getting into Pi and large number arithmetic at a programming level, you can start by looking at this GitHub repo. There are also many other examples on the internet if you search around.

Other:

What about support for other platforms? Mac, ARM, etc...

Short answer: Not right now. There's little to gain for a lot of effort.

While it would be nice to have y-cruncher running everywhere, the time and resource commitment is simply too high. So the best I can do is cover the most important platforms and call it a day.

y-cruncher currently supports 3 platforms:

x86-Windows
x64-Windows
x64-Linux

Either of the x64 platforms is sufficient for y-cruncher's purpose. x86-Windows is there because the program started that way and it's easy to maintain it alongside x64-Windows. On the other hand, x64-Linux is a completely different platform with different compilers and system APIs. For this reason, it takes a significant amount of time and effort to keep y-cruncher working on x64-Linux.

My experience with just Linux has basically convinced me to stay away from any additional platforms for the time being.

What about Mac/x64?

There are several things here:

Support for Mac is not needed for any of y-cruncher's goals. Nobody uses Macs for breaking records, and the ecosystem is too locked down for there to be any sizable population of hardware enthusiasts and overclockers.

Certified Mac hardware is more expensive than equivalent hardware for Windows or Linux. So it's a very high cost of investment unless we enter the realm of Hackintoshes.

From a practical standpoint, I haven't used Macs since my childhood days. I have no experience with modern Mac ecosystem and there would be a lot of learning to do to catch up.

What about ARM?

While ARM is rapidly improving, it's unlikely to be supported until it achieves significant market share in the high-end and high-performance server market.

Translation: Support for ARM will not be considered until:

ARM becomes competitive with x64 in absolute performance.
ARM becomes mainstream in the PC market. (Laptops, desktops, workstations, etc...)

In other words, ARM will not be supported until it "beats" x64. And I'm not gonna try to guess at if and when this will happen. This applies not just to ARM, but also to any other suitable computer architecture.