y-cruncher - Developments

(Last updated: January 16, 2017)

 

 

Back To:

 

Active Branches:

At any time, there are usually two branches of y-cruncher that are active maintained. The following table lists the changes that are confirmed for each active branch.

 

Most easy-to-fix bugs get patched and cherrypicked into all active branches. But an actual release is only done when the bugs involved collectively reach a sufficient level of severity. So the current version that is available for download might be (and quite often is) behind the release branch.

 

Publicly released versions of y-cruncher tend to lag behind the internal ones by several months or more. This is mostly for QC reasons since there is usually zero tolerance for bugs that affect the correctness of a computation. Version 0.7.1 is a rarity in that it was very close to trunk at the time of release.

Branch Version # Changes (from v0.7.1.9466)
v0.7.1 v0.7.1.9466

 

v0.7.2 v0.7.2.9467

The internal refactorings from v0.7.1 continue into this release. These refactorings shouldn't have any noticable changes.

 

Fixes:

  • The Linux binaries will now properly read the CPU topology on multi-socket systems.

New Features:

  • In Windows, y-cruncher will automatically create a minidump file if it crashes. This should make it easier to debug crashes which cannot be reproduced locally in the development environment.

  • The BBP app now has the option to pin all threads to different cores. This solves the problem of imbalanced processor groups.

Optimizations:

  • Minor speedups for specific architectures.

Retunings:

  • The default "Min IO" parameter has been raised to 1 MB to keep inline with modern hard drives.
trunk v0.7.3.9467-25  

 

Developments:

 

y-cruncher was pretty heavily developed between 2008 - 2013. But I've since left school and found a full-time job. While y-cruncher will probably remain a side-hobby for the near future, it won't get as much attention as it used to get. At the very least, I'll continue to fix whatever bugs that are discovered. And I'll make an effort to keep the program up-to-date with the latest instruction set extensions and development toolchains. But for the most part, the project is done.

 

Most of the work right now is in cleaning up the code and refactoring it into idiomatic C++ for long-term maintainability. The program was (and still largely is) a fragile mess of unreadable C hackery with little to no documentation. So at the very least, I'd like to get it into a state where someone other than myself can read it.

 

Nevertheless, it's a never-ending project. So there are things on the to-do list. But it can be a long time before anything gets done.

 

Feature Description Status
AVX512

Intel announced AVX512 back in 2013. And they expanded it a year later.

 

Development on AVX512 for y-cruncher began in (a poorly kept) secret back in 2014. But without any indications of its performance, the work was mostly limited to "obvious" optimizations and high-level refactorings for the wider SIMD.

 

Once the consumer Skylake line was released along with details of the architecture, it became possible to (confidently) construct the performance models that are necessary to do the "non-obvious" AVX512 optimizations.

 

As of March 2016, there are currently 3 internal binaries with AVX512:

  • x64 AVX512-CD (Knights Landing)
  • x64 AVX512-DQ (Skylake Purley)
  • x64 AVX512-IFMA (Cannonlake)

All of these are working and tested through emulation on both Windows and Linux. All that's needed now is the actual hardware for final testing and performance tuning. So there's a lot of AVX512 code that's been sitting around while we wait for the hardware.

 

We were supposed to get this stuff in 2015, but that never happened. Assuming Knights Landing will be well beyond a consumer budget, 2017-2018 seems like a more likely timeframe before any of this sees the light of day.

 

The Knights Landing and Skylake binaries are not expected to change significantly once the hardware becomes available. Most of the AVX512 in those binaries are trivial widenings of the existing AVX2 code. So they can be confidently done with relatively safe assumptions about the hardware.

Nevertheless, there are places which touch things that are more uncertain (such as gather/scatter and vpmullq). But these are far and few between.

 

On the other hand, the Cannonlake binary is another story. The AVX512-IFMA instructions threaten to shake things up by opening an entirely new dimension to bignum arithmetic. As of now, the performance models are insufficient to explore this area in depth.

Rewrite the Radix Conversion

The current radix conversion code is actually a prototype that ended up in production. It has so many problems that it's basically unsalvageable.

  • It's not possible to do checkpoints.
  • It uses more memory than necessary.
  • It's completely inflexible to any modifications that affect the memory layout.
  • It's very fragile and very easy to break.

The current code is actually the 3rd radix conversion to be used in y-cruncher (and the 2nd to use the Scaled Remainder Tree). But it's still a prototype since it's the first to combine all of the following features/optimizations:

  • Scaled Remainder Tree
  • Middle Product
  • Removal of Trailing Zeros
  • Precomputation of the forward FFTs of the powers.
  • Parallelization with bounded memory usage.
  • Swap Mode (out-of-core) Computation

At the time (2011), this was very ambitious - perhaps a little too ambitious. In the end, it required so many hacks that the thing became a complete mess.

 

The radix conversion will need to be completely redesigned and rewritten.

Still trying to figure out a way to approach this in a way that will be maintainable without sacrificing any performance...

 

The existing radix conversion has virtually no internal abstraction. This makes it both very fast and unmaintainable.

Reduced Memory Mode

For Pi Chudnovsky and Ramanujan, add a mode that will allow a computation to be done using less memory/disk at the cost of slower performance.

 

For a typical computation, most of the work requires very little memory. It's the occasional memory spike that causes y-cruncher to have such a high memory requirement.

 

There are about 4 large memory spikes in a Pi computation. In approximate descending order of size, they are:

  1. The final Binary Splitting merge in the series summation.

  2. The Final Multiply.

  3. The radix conversion verification.

  4. The first split of the radix conversion.

These spikes can be flattened via space-time tradeoffs in the respective algorithms. Since the trade-off only needs to be done at the spikes, the overall performance hit should be reasonably small.

As of v0.6.8, only the first two memory spikes have been suppressed. The overall memory reduction is not enough to be worth enabling the feature publicly.

 

The last two spikes both involve the radix conversion which is completely blocked pending the rewrite of the radix conversion code.

 

This partially completed feature was used for the 12.1 trillion digit computation of Pi.

 

 

Areas of Research:

 

Silent Data Corruption:

Disk I/O is the bane of large number computations. Historically, it was bad only because it is slow. Nowadays, it's also unreliable.

 

"Unreliability" comes in several forms:

  1. Basic I/O errors where an API call to read or write fails.
  2. The entire drive fails completely leading to permenant data loss.
  3. Silent data corruption that is neither detected by the hardware nor the operating system.

(1) and (2) are easily solved using checkpoint-restart and periodic backups. y-cruncher has supported checkpointing this since v0.5.4 so it isn't really a problem anymore. But (3) is very scary and remains a huge problem today.

 

Silent data corruption is the worst since it's undetected. It plagues database applications, and unfortunately, y-cruncher is extremely vulnerable as well. If an error manages to slip through y-cruncher's built-in redundancy checks, it will cause a computation to finish with the wrong digits.

 

 

Analysis:

 

Hard drives already use error-correction and CRCs to ensure data integrity. And transfers between the drive to the controller are also protected with CRCs. So you would expect that data corruption would be detected by the operating system right? Nope...

 

When a hard drive fails to read a sector due to CRC failures, it usually propagates all the way back into the program and manifests as a (1). So that's not a problem. But transfer errors between the drive and the controller are less ideal. Supposedly, transfer errors lead to retransmission. But this isn't always the case.

 

Throughout the years of developing y-cruncher, transfer errors have also been observed to:

  1. Severely slow down or hang the drive in question.
  2. Silent data corruption.
  3. Silent data corruption and an increase to the S.M.A.R.T. "Ultra ATA CRC error count".

(1) is to be expected if the connection quality is so bad that the data gets stuck in a retransmission loop.

(2) is also be expected if the corrupted data passes a CRC by chance. (I have no idea how long the CRC is, but if it's CRC32, a 1 in 4 billion chance isn't small.)

(3) makes absolutely no sense at all. If the hard drive was able to detect the error, then why the hell doesn't it notify the OS that the operation failed?

 

In any case, there are other things that don't add up. And in the end, we are forced to accept the reality.

 

To date, y-cruncher's only defense for silent data corruption is the RAID3. When the swap file configuration is set to use RAID3, (almost) all reads are parity checked. And if the data is bad, it will report a failure. The parity bits are flipped so that zeroing errors that zero everything will fail the parity.

 

Unfortunately, this is far from sufficient:

To date, silent data corruption has yet to bring down a world-record attempt. But it happens regularly in the development test machines (which contain some very old hard drives). There has also been an instance reported on a forum where a 1 trillion digit computation of Pi failed - presumably due to silent data corruption.

 

Approximately 80% of the disk I/Os that y-cruncher does are covered by redundancy checks at a higher level. So in the absence of RAID 3, silent data corruption will be detected with 80% probability if we assume uniform distribution. To put it simply, 80% is not good enough. But at the very least, failing a redundancy check should be an immediate and serious red flag.

 

 

Possible Solutions:

 

Use a filesystem designed for data integrity (such ZFS). While this is almost too obvious, there will also be performance trade-offs.

 

The other approaches involve adding checksumming into y-cruncher's raid-file implementation. But this is easier said than done:

Inlining checksums into the data will break sector-alignment and incur copy-shifting overhead. But perhaps this can be merged with the RAID interleaving. Placing the checksums elsewhere will add seek overhead. Either way, it will be messy.

 

Due to the poor state of the current raid-file code, any significant change will likely imply a complete rewrite of the entire raid-file implementation.

 

 


Non-Uniform Memory Access (NUMA):

y-cruncher is currently a shared memory program. So it will run efficiently given the following assumptions:

  1. All the memory in the system is shared and can be accessed by all processor cores.
  2. All processor cores have low latency and high-bandwidth access to all the memory.

As of 2016, these assumptions hold for almost all single-socket systems - which include the majority of personal desktop computers and low-end servers.

 

But due to the laws of physics and the speed of light, assumption #2 does not hold for larger systems. So now we're into the territory of Non-Uniform Memory Access (NUMA), cluster/distributed computing. For now the focus will be on NUMA systems since that's what the majority of the multi-socket systems are.

 

In short, y-cruncher does not run efficiently on NUMA. While it scales reasonably well onto dual-socket, it all goes downhill after that. There have been numerous reports of non-scaling (and even backwards scaling) on quad-Opteron systems. And all of this is due to the NUMA.

 

To exacerbate the problem, OS's normally default to a "first-touch" policy for memory allocation. The problem is that y-cruncher allocates all its memory at the start of a computation with a single thread. With the first-touch policy, all the memory will be allocated (or heavily biased) on one NUMA mode. So during the computation, all the cores from all the NUMA nodes will be hammering that one node for memory access. The interconnect going in and out of that one node will be overwhelmed by the traffic which leads to terrible performance.

 

This can be solved by telling the either the OS or the BIOS to use an interleaved allocation policy that spreads out the memory across all the nodes. This balances the interconnect traffic and leads to a significant performance on modern dual and quad-socket systems. But ultimately, this doesn't actually solve the problem of NUMA since the memory accesses (which are now randomized) will still be hitting remote nodes over the interconnect.

 

To be efficient, the program needs to be aware of the NUMA and needs to be designed specifically for it. This is a fairly difficult task which is made worse by the unlimited combinations of NUMA topologies. Long story short, making y-cruncher NUMA aware will require changing the way that memory is fundamentally stored and managed. This means that it will need to be done as a new "mode" much like how "Ram Only" and "Swap Mode" have completely different data storage formats.

 

 

Currently, the most promising solution is to generalize the functionality of swap mode in the following ways:

  1. Instead of using a disk array, replace it with a "far memory" interface which can be anything from disk to interleaved NUMA or even cloud storage.
  2. Leverage the swap mode algorithms for minimizing disk access to minimize access to far memory.
  3. Duplicate shared resources into each NUMA node to reduce interconnect thrashing.

 


GCD Factorization:

Many binary splitting recursions have a lot of common factors between the fraction numerators and denominators. Implement an optimization that seeks to remove these common factors so that the numbers are smaller in size.

 

GCD Factorization was first described here (now a dead link). Since then, there have been subsequent publications that describe the method:

The general idea of the optimization is to keep the prime factorization of some of the recursion variables. Then use this factorization to obtain the GCD of the numerator and denominator terms. Obtaining this prime factorization is done using a sieve.

 

Current implementations of this optimization include GMP Pi Chudnovsky and TachusPi. And most literature reports a speedup of about 20 - 30% for Pi Chudnovsky.

 

In the context of y-cruncher, this optimization is applicable to the following constants/algorithms:

 

Current Problems:

 

Even though GCD factorization has been known for years, y-cruncher has yet to use it for a number of reasons:

  1. The naive approach to sieving the index terms will require to prohibitive amount of memory. This problem exists in GMP Pi Chudnovsky. Fabrice Bellard's TachusPi has a more efficient approach that does it using O( sqrt(N) ) memory - presumably using some sort of sieve segmentation. But he left few details on this approach.

  2. The data structure that is needed to perform the sieve and store the prime factorization is somewhat incompatible with y-cruncher's memory model.

  3. y-cruncher lacks the ability to efficiently perform integer divisions.

While (1) and (3) are certainly solvable, (2) is more difficult. All the ideas so far for attacking (2) are either complicated and fragile, or they require breaking out of the current memory model. For now, there's bigger fish to fry.