y-cruncher - Developments

(Last updated: August 3, 2019)

 

 

Back To:

 

Active Branches:

 

This is a table of all the active branches of y-cruncher. As this is just a side-hobby, overall development is slow and there is no formal release cycle. But in recent years, feature releases have been approximately every 4 - 8 months with multiple patches in between.

 

Version 0.7.8 is now in partial feature freeze. ETA for release is the September/October timeframe.

Branch Version # Changes (from v0.7.7.9501)
v0.7.7 v0.7.7.9501

 

v0.7.8 v0.7.8-9492-199

Swap Mode Changes:

  • The "Swap Configuration" has been replaced with "Far Memory Configuration". In the past, you had to use the RAID-0/3 implementation. Now, you can now choose from multiple "far-memory" frameworks. In other words, you can now select far-memory frameworks like you can with the parallel frameworks and memory allocators. Currently, there are only 2 choices. But more are expected in the future.

    • Legacy Disk RAID 0/3: This is the really old multi-level raid-file implementation that has been in use since v0.6.1.
    • Disk RAID 0: This is new to v0.7.8. It is a dedicated RAID 0 implementation that is more efficient than the Legacy RAID 0/3.

  • There is now a "Far-Memory Tuning" menu that holds performance characteristics of the far-memory configuration. The "bytes/seek" parameter has been moved into this menu as a sub-parameter.

  • The legacy RAID 0/3 now has its own allocator settings for its I/O buffers. In the past, it used the same allocator as the one used for computational memory. This was decoupled as a part of the aforementioned far-memory refactor.

  • The new Disk RAID 0 framework allows per-path allocators. When combined with the NUMA-aware allocators, it becomes possible to place the I/O buffers on the NUMA nodes that are closest to their respective I/O devices.

 

Changes to Aide Large Computations:

  • The Disk RAID 0 framework supports error-detection checksums that should be able to detect most forms of silent I/O data corruption. This is enabled by default even though it has a small to moderate performance penalty.

  • You can now tell y-cruncher to run a system command after each checkpoint. This command is a blocking operation that it will halt execution until it is finished. This makes it easier and safer to perform automated backups of the checkpoints.

  • The Custom Compute menu now gives an upper-bound for how much space is needed to store the largest checkpoint. This lets you properly budget storage for backups.

 

BBP Digit Extractor Revamp:

 

The BBP Digit Extractor has been revamped for the first time since v0.6.8. Previously, the digit extractor had an offset limit of 246. This meant that it couldn't verify Pi computations larger than 84.7 trillion decimal digits. The new version increases this limit significantly.

  Offset Limit
y-cruncher Version Original Formula Bellard's Formula
< v0.7.8 246 246

v0.7.8

(no AVX2)

~248 ~249

v0.7.8

(AVX2 or later)

~258 ~259

The new limit of 259 will likely be large enough for many years to come. But reaching these higher offsets will come at a performance cost as it is significantly slower to sum up terms with larger divisors. As a result, performance will be less linear and less predictable than before.

 

As part of this revamp, the BBP Digit Extractor is now faster than before for the sizes that were reachable in past versions.

 

 

Misc. New Features:

  • New Formula for Catalan's Constant: Jesús Guillera discovered another fast formula for Catalan's Constant. It is the 2nd fastest known formula for the constant and has been implemented natively in y-cruncher.

  • Reduced Memory Mode for Pi: Two additional "algorithms" for Pi have been added that run the Chudnovsky and Ramanujan formulas using less memory/storage at the cost of speed. The current memory reduction is about 5 - 15%. Since these are optimized for memory rather than speed, future versions of y-cruncher may improve the memory reduction at even more cost to speed. This feature has existed since 2013 and is finally being enabled in a public release. Though some older releases of y-cruncher already had it enabled as a hidden option.

  • Added a new parallel framework based on a simple task queue. This framework wasn't intended to be performant for computation. But it has been added to the framework list anyway.
  • Some of the parallel frameworks now have an option to better control the recursive fan-out behavior when dispatching a large number of tasks.
  • When resuming a computation, the resume date will be printed on the screen.
  • The exponent limit for the shift and power functions has been increased to the full 64-bit range.
  • Support for radicals has been added to the custom formulas.
  • Thread Building Blocks (TBB) is now supported in the dynamically linked Linux binaries.
  • All the non-AVX Windows binaries will now be able to utilize more than one processor group.

Optimizations:

  • The 18-CNL binary is now properly tuned for the Core i3 8121U. Though it will probably get re-tuned for Ice Lake in the future.
  • Most of the newer binaries have been retuned.
  • Some of the parallel frameworks have been re-tuned.
  • The power function in custom formulas is now faster for small inputs.

Removed Features:

  • Support for Windows Vista has been dropped. All Windows binaries now require Windows 7 or later.
  • The 16-KNL binary has been removed. y-cruncher never ran well on Xeon Phi and Intel has officially discontinued the entire processor line.
  • The 11-BD1 binary no longer uses XOP instructions. This allows it to be run on Zen 1 and Zen+ processors. The performance impact on Bulldozer processors should be minimal. Since XOP is no longer supported on any modern processors, it has become increasingly difficult and inconvenient to test this binary.

Other:

  • Lots of minor UI and configuration changes due to internal refactorings.
  • Performance fluctuations for the Windows AVX and AVX512 binaries due to compiler changes.
trunk v0.7.9.9502-1

Changes are relative to v0.7.8.

 

Deprecations:

  • The 00-86, 04-P4P, 05-A64, and 08-NHM binaries will no longer run efficiently on processors older than Nehalem.

The previous version (v0.7.6) is no longer maintained. But build support will remain until at least January 2020 just in case any unfinished long-running computations encounter a bug that requires a new build to continue.

 

 

Developments:

 

As of 2019, y-cruncher is over half a million lines of code. For a single person hobby project, this is large enough to easily collapse under its own weight. So to keep things managable, development of the project has slowed down in recent years to refocus on long-term maintainability rather than rushing out tons of features and enhancements in a short amount of time.

The result of these "policies" is that it can take a very long time for anything to get done - easily months or years. But for a personal project with no deadlines or oversight, this is perfectly acceptable. Since there is less than one person working on y-cruncher, development is optimized for long-term throughput rather than short-term latency of new features.

 

Because of the long incubation periods of various features/improvements, there are typically many things going on at once. So I can work on whatever I feel like at any given point. If I get bored of something, I'll shelve it and come back (potentially years) later. What's shown on this page is just a subset of all things going on.

 

Ongoing:

Feature Description Status
Nth Root Radicals

Custom Formulas:

 

Add support for nth root radicals.

This will enable the computation of Gamma(1/3).

This was originally intended for v0.7.7 with the first release of the Custom Formula feature. But it slipped due to some unexpected numerical stability issues combined with the large backlog of other tasks for v0.7.7.

 

This is now confirmed for v0.7.8.

The new function is called "InvNthRoot" and will compute x^(-1/r).

 

Ideas on the Radar:

Feature Description Status
exp(x)

Custom Formulas:

 

Add support for the exponential function.

This will enable the fully generic non-integer power function for real numbers.

 

This involves inverting log(x) with Newton's Method.

This one is going be ugly. While it's conceptually easy to invert the natural logarithm with Newton's Method, there's a whole mess of other things to deal with:

  • log(x) is already painfully slow. A simple implementation of Newton's Method invert will result an even more painfully slow exp(x). There are a lot of optimizations that the current AGM and log(x) implementations are not doing that may affect the exp(x) implementation. If the end goal is to have a fully optimized exp(x), is now even an appropriate time to do it? There is a real possibility that any work done now will need to be redone later.

  • The current log(x) has this mess with the constants Pi and log(2). This is probably going to carry over to exp(x).

  • Status printing is going to be a problem. Right now log(x) has a lot of status printing - especially when Pi and log(2) are not precomputed. It will be called many times for exp(x). There needs to be a way to slim it down without sacrificing too much information delivery.

This was on the roadmap for v0.7.8, but it has been shelved indefinitely.

Radix Conversion Checkpointing Add checkpointing to the radix conversions.

Checkpointing was never implemented in the past for two reasons:

  1. Technical debt: The current radix conversion implementation was never designed to be checkpointed. It is just a prototype that ended up in production and it's messy as hell.

  2. The algorithm is an in-place algorithm. Each step either destroys or modifies the state of the previous step. This is problematic because it's difficult to guarantee a consistent state when the computation can be interrupted at any point.

It's time to revisit this since the conversion can take weeks for the Pi record frontier.

Improved Stress Tester

The current stress-tester has increasingly poor coverage in today's processors. In fact, the program's unit and integration test frameworks are a better stress test than the dedicated stress test itself!

 

Find a way build a new stress test around these "better" workloads.

The difficulty here is that the "best" stress test requires mixing all the different instruction sets (SSE, AVX, AVX512). But each of y-cruncher's binaries focuses on only one of them and it's not easy to use a "lower" ISA when compiling for a "higher" one since the compiler will try to vectorize the code to use the higher one.

Optimized Square Root

Custom Formulas:

 

The current square root is just the inverse square root followed by a multiply by the input.

 

It may be possible to do better by merging that multiply in the final iteration - as is the case for division and reciprocal.

 
Optimize the AGM

Custom Formulas:

 

In the final iterations of the AGM, much of the output is already known. These can be used to skip some of the early iterations of the Newton's Method square root.
 
Optimize log(x)

Custom Formulas:

 

The current log(x) implementation is a dumb wrapper around AGM(1,x).

 

There are ways to make the AGM require fewer iterations that should be investigated.

 
Non-Monotonically Convergent Hypergeometric Series

Custom Formulas:

 

Extend the SeriesHypergeometric function to allow series that are not monotonically convergent.

 

This is needed for confluent hypergeometic functions at large inputs where the series initially diverges before eventually converging.

 

This will allow certain approximation algorithms to be implemented:

  • Sweeney's method for the Euler-Mascheroni Constant
  • Gamma function at rational inputs.

This is expected to be very difficult.

 

Irregular convergence behavior wreaks havoc on the Binary Splitting implementation since a lot of assumptions break down. While the Brent-McMillan formula for the Euler-Mascheroni Constant already exhibits this behavior, it is specially handled.

 

The most difficult part of this is precision control. In order to do precision control, y-cruncher needs to know:

  1. How large the series terms can get.
  2. What magnitude the final value converges to.

#1 is complicated, but likely doable.

#2 doesn't seem approachable in the generic case with the mathematical techniques that I know of.

 

#2 is difficult due to destructive cancellation. Confluent hypergeometric functions are notorious for this behavior. And it is just the simplest case.

 

Take a series and something as simple as taking its derivative can drastically alter the magnitude of the value that it converges to. (i.e. exp(-x2) and the Error Function for very large x)

 

It is likely that #2 will require the user to explicitly tell y-cruncher what the final magnitude will be.

 

Stalled:

These are projects which have either stalled made no recent progress or are so long term that there is no roadmap.

Feature Description Status
Slave Mode

Provide a way to control y-cruncher through TCP. This will allow 3rd party applications to build a GUI around y-cruncher.

 

More details here: https://github.com/Mysticial/y-cruncher-GUI

An early version of this was launched with v0.7.6. As of v0.7.7, it theoretically should be complete enough to implement a full GUI for both the stress tester and custom compute options.

 

Incremental progress will continue to be made. In particular a unified way to express menus and suboptions is still needed.

 

As of 2019, priorities have shifted elsewhere.

MRFM

MRFM stands for "Multi-Region Far Memory". It is a very large experimental project that will attempt to solve the NUMA problem and more generally, the supercomputer problem.

 

The current design of MRFM that is planned is a fundamental departure from y-cruncher's battle-tested model of computation. So virtually all high-level code will need new implementations for MRFM.

 

As a result, the plan calls for two completely new computation modes:

  • Multi-Region NUMA
  • Multi-Region Swap

This will bring the total number of modes to 4. But if all goes as planned, Multi-Region Swap will become a no-compromise generalization of the current Swap Mode. So it will be possible to remove the current Swap Mode without losing much functionality.

 

Due to the sheer scale of this project along with a large number of unknowns, this was expected to be (and has become) a multi-year project with no guarantee of success.

Much of the planning for this began years ago. But actual coding didn't start until around September of 2016.

 

As of 2018, progress is stalled due to large amounts of technical debt in y-cruncher's high-level code which have been accumulating for years.

 

MRFM remains a multi-year project with no end in sight.

 

 

Areas of Research:

 

Silent Data Corruption:

Disk I/O is the bane of large number computations. Historically, it was bad only because it is slow. Nowadays, it's also unreliable.

 

"Unreliability" comes in several forms:

  1. Basic I/O errors where an API call to read or write fails.
  2. The entire drive fails completely leading to permenant data loss.
  3. Silent data corruption that is neither detected by the hardware nor the operating system.

(1) and (2) are easily solved using checkpoint-restart and periodic backups. y-cruncher has supported checkpointing this since v0.5.4 so it isn't really a problem anymore. But (3) is very scary and remains a huge problem today.

 

Silent data corruption is the worst since it's undetected. It plagues database applications, and unfortunately, y-cruncher is extremely vulnerable as well. If an error manages to slip through y-cruncher's built-in redundancy checks, it will cause a computation to finish with the wrong digits.

 

 

Analysis:

 

Hard drives already use error-correction and CRCs to ensure data integrity. And transfers between the drive to the controller are also protected with CRCs. So you would expect that data corruption would be detected by the operating system right? Nope...

 

When a hard drive fails to read a sector due to CRC failures, it usually propagates all the way back into the program and manifests as a (1). So that's not a problem. But transfer errors between the drive and the controller are less ideal. Supposedly, transfer errors lead to retransmission. But this isn't always the case.

 

Throughout the years of developing y-cruncher, transfer errors have also been observed to:

  1. Severely slow down or hang the drive in question.
  2. Silent data corruption.
  3. Silent data corruption and an increase to the S.M.A.R.T. "Ultra ATA CRC error count".

(1) is to be expected if the connection quality is so bad that the data gets stuck in a retransmission loop.

(2) is also be expected if the corrupted data passes a CRC by chance. (I have no idea how long the CRC is, but if it's CRC32, a 1 in 4 billion chance isn't small.)

(3) makes absolutely no sense at all. If the hard drive was able to detect the error, then why the hell doesn't it notify the OS that the operation failed?

 

In any case, there are other things that don't add up. And in the end, we are forced to accept the reality.

 

To date, y-cruncher's only defense for silent data corruption is the RAID3. When the swap file configuration is set to use RAID3, (almost) all reads are parity checked. And if the data is bad, it will report a failure. The parity bits are flipped so that zeroing errors that zero everything will fail the parity.

 

Unfortunately, this is far from sufficient:

To date, silent data corruption has yet to bring down a world-record attempt. But it happens regularly in the development test machines (which contain some very old hard drives). There has also been an instance reported on a forum where a 1 trillion digit computation of Pi failed - presumably due to silent data corruption.

 

Approximately 80% of the disk I/Os that y-cruncher does are covered by redundancy checks at a higher level. So in the absence of RAID 3, silent data corruption will be detected with 80% probability if we assume uniform distribution. To put it simply, 80% is not good enough. But at the very least, failing a redundancy check should be an immediate and serious red flag.

 

 

Possible Solutions:

 

For a long time, solving the silent corruption problem was deemed too difficult. All software solutions come with a non-trivial amount of overhead.

 

y-cruncher v0.7.8 saw a complete overhaul of the Swap Mode functionality which finally allowed new features to be added. With the door opened to new stuff, one of the first experiments was a checksumming protocol which turned out be more performant than expected. Thus y-cruncher v0.7.8 will feature checksumming for disk I/O that will be enabled by default.

 

Will this solve the silent corruption problem? Only time will tell.

 

 


Non-Uniform Memory Access (NUMA):

y-cruncher is currently a shared memory program. So it will run efficiently given the following assumptions:

  1. All the memory in the system is shared and can be accessed by all processor cores.
  2. All processor cores have low latency and high-bandwidth access to all the memory.

As of 2016, these assumptions hold for almost all single-socket systems - which include the majority of personal desktop computers and low-end servers.

 

But due to the laws of physics and the speed of light, assumption #2 does not hold for larger systems. So now we're into the territory of Non-Uniform Memory Access (NUMA), cluster/distributed computing. For now the focus will be on NUMA systems since that's what the majority of the multi-socket systems are.

 

In short, y-cruncher does not run efficiently on NUMA. While it scales reasonably well onto dual-socket, it all goes downhill after that. There have been numerous reports of non-scaling (and even backwards scaling) on quad-Opteron systems. And all of this is due to the NUMA.

 

To exacerbate the problem, OS's normally default to a "first-touch" policy for memory allocation. The problem is that y-cruncher allocates all its memory at the start of a computation with a single thread. With the first-touch policy, all the memory will be allocated (or heavily biased) on one NUMA mode. So during the computation, all the cores from all the NUMA nodes will be hammering that one node for memory access. The interconnect going in and out of that one node will be overwhelmed by the traffic which leads to terrible performance.

 

y-cruncher v0.7.3 adds the ability to interleave memory across nodes. This balances the interconnect traffic and leads to a significant performance on modern dual and quad-socket systems. But ultimately, this doesn't actually solve the problem of NUMA since the memory accesses (which are now randomized) will still be hitting remote nodes over the interconnect.

 

To be efficient, the program needs to be aware of the NUMA and needs to be designed specifically for it. This is a fairly difficult task which is made worse by the unlimited combinations of NUMA topologies. Long story short, making y-cruncher NUMA aware will require changing the way that memory is fundamentally stored and managed. This means that it will need to be done as a new "mode" much like how "Ram Only" and "Swap Mode" have completely different data storage formats.

 

 

Currently, the most promising solution is to generalize the functionality of swap mode in the following ways:

  1. Instead of using a disk array, replace it with a "far memory" interface which can be anything from disk to interleaved NUMA or even cloud storage.
  2. Leverage the swap mode algorithms for minimizing disk access to minimize access to far memory.
  3. Duplicate shared resources into each NUMA node to reduce interconnect thrashing.

 


GCD Factorization:

Many binary splitting recursions have a lot of common factors between the fraction numerators and denominators. Implement an optimization that seeks to remove these common factors so that the numbers are smaller in size.

 

GCD Factorization was first described here (now a dead link). Since then, there have been subsequent publications that describe the method:

The general idea of the optimization is to keep the prime factorization of some of the recursion variables. Then use this factorization to obtain the GCD of the numerator and denominator terms. Obtaining this prime factorization is done using a sieve.

 

Current implementations of this optimization include GMP Pi Chudnovsky and TachusPi. And most literature reports a speedup of about 20 - 30% for Pi Chudnovsky.

 

In the context of y-cruncher, this optimization is applicable to the following constants/algorithms:

 

Current Problems:

 

Even though GCD factorization has been known for years, y-cruncher has yet to use it for a number of reasons:

  1. The naive approach to sieving the index terms will require a prohibitive amount of memory. This problem exists in GMP Pi Chudnovsky. Fabrice Bellard's TachusPi has a more efficient approach that does it using O( sqrt(N) ) memory - presumably using some sort of sieve segmentation. But he left few details on this approach.

  2. The data structure that is needed to perform the sieve and store the prime factorization is somewhat incompatible with y-cruncher's memory model.

  3. y-cruncher lacks the ability to efficiently perform integer divisions.

While (1) and (3) are certainly solvable, (2) is more difficult. All the ideas so far for attacking (2) are either complicated and fragile, or they require breaking out of the current memory model. For now, there's bigger fish to fry.