y-cruncher - Version History

Release Date

Version

Changes

January 9, 2025

v0.8.6.9545

Windows + Linux

This version is not performance comparable to v0.8.5 due to new optimizations.

General New Features:

General performance changes due to a completely new training model.
Tuning Profiles: Each binary now contains various preset tuning profiles optimized for different CPUs.
The stress-tester will now display the elapsed time when an error occurs.
Benchmark Pi now has a 2nd multi-threaded option that disables SMT. (uses no more than 1 vcore per physical core)
Custom Compute now has an option to load defaults that disables SMT.
The refresh rate of status prints has been increased.
New and faster secondary formulas for Log(2) and Log(3) thanks to Jorge Zuniga.
Over a dozen new custom formulas have been added.

Parallel Framework Improvements:

The Task Queue and Push Pool frameworks now support core affinities.
The Task Queue and Push Pool frameworks now have better load balance for imbalanced processor groups. (untested)
The TBB parallel framework now supports limiting the # of threads.

Integration/Wrapper Improvements:

New command line options have been added to better control console printing. You can choose beween WinAPI and Virtual Terminal escape sequences for how y-cruncher should do console manipulation such as colors and refreshing status lines.
Added a command line option to log all console output to a file. This is a better alternative to piping that will strip out all escape sequences.

Other:

Removed FFTv2, FFTv3, SFTv2, and SFTv3 from the stress tester.

August 4, 2024

v0.8.5.9544

Windows + Linux

Fixes:

Added comma seperators to BBP offset display.
Minor fix to improve Benchmate integration.

July 21, 2024

v0.8.5.9543

Windows + Linux

Fixed a critical bug in the AVX2 binaries that may cause the base conversion to fail.

July 20, 2024

v0.8.5.9542

Windows + Linux

Added "24-ZN5", a new binary optimized for AMD's Zen5 processors.

July 7, 2024

v0.8.5.9541

Windows + Linux

Fixes:

Fixed an issue in swap mode when using more than 2TB of memory.
Fixed an issue in reduced memory mode for Pi computations larger than 1 trillion digits.

Both bugs have been present since v0.8.2.

July 1, 2024

v0.8.5.9539

Windows + Linux

Zen5 optimizations were originally intended for this release. But due to delays in getting the final hardware, this will be pushed to a future update.

New Features:

The BBP digit extractor has now been made into a formal benchmark.
- The BBP program now supports command line options.
- The default BBP algorithm has been changed to Huvent's formula.
- BBP computations now produce validation files.
It remains to be seen if this benchmark will be added to HWBOT.

The BBP benchmark will be a CPU-only benchmark that is unaffected by memory bottlenecks. It will complement the existing Pi benchmarks which have become increasingly memory-bound.

Stress Tester Changes:

Added new tests SNT and SVT which are small in-cache versions of N63 and VT3.
Added new tests FFTv4 and SFTv4. This is a new floating-point FFT implementation.
FFT and SFT is the old floating-point FFT and remains for now. It will be removed in the future.

Other Changes:

Added a new floating-point FFT implementation. This improves performance by a few %.
Switched compilers from ICC to ICX. This hurts performance by a few %. (mostly offsetting with the above)
The BBP program is now slower because the ICX compiler is worse than the ICC compiler.
y-cruncher will now accept json for loading config files.
y-cruncher will now allow computations up to 500 trillion digits without developer authorization.

You may notice some visual changes where "Tuning Profiles" have been added to the program in various places. These are part of a future feature that did not make the cut for this release. So for now, they serve no purpose other than other than visual aesthetics.

Fixes:

For the BBP program, the 12-BD2 binary now correctly uses the FMA3 codepath instead of the SSE4.1 codepath.

March 5, 2024

v0.8.4.9538a

Windows

No functional changes. The "y-cruncher.exe" launcher binary has been recompiled with Visual Studio 17.7.1 which gives fewer virus scanner false positives.

February 22, 2024

v0.8.4.9538

Windows + Linux

Fixed a crash that can happen when using more than 64 threads.

February 21, 2024

v0.8.4.9537

Windows + Linux

With the exception of AMD's Bulldozer line of processors, this release is expected to have minor to negligible performance changes from v0.8.3. Benchmarks should still be comparable going all the way back to v0.8.2.

General Changes:

The limit of y-cruncher has been (tentatively) increased from 1 x 10¹⁵ to 108 x 10¹⁵ decimal digits. Computations above 200 trillion digits will still require developer authorization for now. The Euler-Mascheroni Constant has a lower limit of around 31 x 10¹⁵ digits due to the current implementation overflowing a 64-bit integer at a lower size.

The swap mode radix conversion has been completely rewritten and now supports checkpointing.

Replaced the "11-BD1" binary with "12-BD2" with the following changes:
- 12-BD2 uses FMA3 instead of FMA4.
- Optimizations for 1st gen AMD Bulldozer are therefore dropped.
- Change of compiler from MSVC to ICC for Windows.

Validation files now include more samples of digits.
Validation files now records statistics for hexadecimal digits even if they are disabled for output.

Math Changes:

Log(x) for both the built-in constant and the custom formula function will now special case for powers of 2, 3, and 5 using these new (and faster) formulas by Jorge Zuniga.

The BBP digit extractor now supports Huvent's BBP formula which is slightly faster than Bellard's.

Custom Formula Changes:

New Functions:

Divide(x, y) will now special case for x being a small integer. It is no longer necessary to use Power(x, -1) and LinearCombination() to get the same effect performance-wise.

ArcCoth(x) now supports non-integer inputs.

Power(x, y) now supports non-integer powers.

Invsqrt(x) of a large number and all dependent functions are slightly faster. (affects Sqrt(x), AGM(x), and Log(x))

All functions that internally require a constant (such as Pi, e, or log(2)) can optionally take them as parameters to avoid recomputing them. This deprecates Log-AGM since the ability to specify Pi and log(2) has been added to Log(x) itself.

As a result of these new functions, there are a lot of new formulas as well as modifications to existing ones.

Bug Fixes:

Fixed the Amdahl's Law + Zen4 Hazard debacle affecting swap mode computations on machines with many cores.
Carryout parallelization for the VT3 and N63 algorithms has been fixed and re-enabled.
Fixed a bug that may cause extremely large swap mode multiplications to pick suboptimal parameters.

February 13, 2024

v0.8.3.9533

Windows + Linux

Bug Fixes:

Fixed a serious bug that may cause large multiplications larger than 29 trillion digits to fail.
Fixed an issue where a computation error in swap mode may crash the program before it can print out the error.
Fixed a bug in the 32-bit binaries that may cause large swap mode computations to fail. The root cause is an integer overflow of size_t.
Fixed HWBOT submitter not recognizing v0.8.3 for launching benchmarks.
The Linux binaries now have the correct version of libtbb.
Fixed config load/saving not working for the swap multiply tester.

December 7, 2023

v0.8.3.9532

Windows + Linux

Fixed an issue where the config files would not properly handle escape characters. This prevented the use of NTFS mount points for file paths to exceed the 26 drive letter limit.

This bug was introduced in v0.8.3 when the config file system was rewritten.

December 5, 2023

v0.8.3.9531

Windows + Linux

Bug Fixes:

Fixed an issue that prevented the program from working on Windows 7.
Fixed a potentially serious issue that may cause incorrect computation*.

*This is caused by a bug in the parallel carry propagation for the VT3 and N63 algorithms. While the issue has never appeared in actual computations or even internal regression tests, it was found while refactoring the relevant code for v0.8.4.

The fix for v0.8.3 is to disable the parallelism with a proper fix coming in v0.8.4. Performance impact should be negligible as it only becomes performance critical during pathological carryout which is extremely rare and does not happen for regular Pi benchmarks.

December 2, 2023

v0.8.3.9530

Windows + Linux

Part 3 of the ongoing rewrites.

As with v0.8.2, this release has major intrusive changes to swap mode. If you will be attempting a long running computation (such as a Pi record), it is strongly recommended to test things at scale before attempting the full computation. This version makes major changes to the software raid and the checkpoint restart - both of which are critical to world record attempts.

There are no changes in this release that are expected to affect competitive benchmarking or hardware reviews. The performance of in-memory Pi computations remains unchanged from v0.8.2 and thus benchmarks can be compared across versions.

Math Changes:

Added two new fastest formulas for Zeta(3) by Jorge Zuniga.
Added a new 2nd fastest formula for Catalan's constant by Jorge Zuniga.
Added two new formulas for Lemniscate by Jorge Zuniga.
Lots of new custom formulas! Zeta(5), Gamma(1/3), and Gamma(1/4) have new pairs of fastest formulas.
InvNthRoot for custom formulas is now faster if the input is a small integer.
Removed all old formulas for Zeta(3).
Removed all old formulas for Catalan's constant except for Pilehood-short.
Removed the ArcSinlemn-based formulas for Lemniscate by Gauss and Sebah.
Removed ArcSinlemn() from the custom formulas.

Infrastructure Changes:

Removed the VST algorithm.
Binaries are now smaller due to the removal of nearly 250,000 lines of code. (consisting of the HNT, N32, N64, VST, and C17 algorithms)
The Disk Raid 0 framework for swap mode now has optimizations for SSDs.
The Disk Raid 0 framework now supports I/O parallelism - including between reads and writes. This can increase utilization of storage devices with independent read and write channels.
Removed the legacy RAID 0+3 framework.
On Windows, swap files are now read-protected to non-adminstrators to reduce the risk of leaking raw device data from SetFileValidData().
Swap mode now supports keeping multiple checkpoints in a zero-overhead manner. If a new checkpoint contains all the files from an older checkpoint, that older checkpoint will not be deleted yet. This means there can be multiple resume points should be the program get interrupted. If the latest checkpoint doesn't work, you may be able to resume from an older checkpoint if it's available.

The guide for swap mode has been updated to reflect the new changes.

Bug Fixes:

Untested fix for an issue in Linux where thread-affinities may not always work.
Fixed an issue where base conversion can fail for certain "short" numbers.

October 13, 2023

v0.8.2.9524

Windows + Linux

Fixed a serious bug in swap mode where a large multiplication with a lot of memory may fail.

September 21, 2023

v0.8.2.9523

Windows + Linux

Fixed detection of newer instructions sets like APX and AVX10.

This bug has no functional effect on the program since y-cruncher does not use these newer instruction sets yet.

September 4, 2023

v0.8.2.9522

Windows + Linux

Part 2 of the ongoing rewrites.

If you are planning a large computation like a Pi record attempt, it is strongly recommended to test this version at scale before attempting the actual run. The changes in this release are extremely invasive and have completely replaced the very core of the large sized disk multiplication.

Features:

Removed the 20-ZN3 ~ Yuzuki binary. Performance is poor due to inadequate hardware for tuning. (I need a Ryzen 9 5950X to do this.)
Removed the N64 and C17 algorithms. N64 has been superceded by its rewrite, the "N63". C17 has been superceded by VT3.
Added a new algorithm "N63" which is the ground-up rewrite of the 64-bit NTT (N64).
VT3 and N63 now support swap mode. Thus swap mode computations now use VT3 and N63 instead of VST, C17, and N64.
Swap mode now has a new framework that uses ram as far-memory. It's primarily for testing purposes, but can probably be used for more efficient computations on systems with HBM. (run the program in HBM, but use ram as disk/far-memory)
The swap multiply tester now supports config files.
The command line option for specifying swap mode paths has different implementations:
- "raid0" will use the Disk Raid 0 framework.
- "raid3" will use the Legacy Raid 0+3 framework.
- "ram" will use Ram Drive framework.
In the past, it always used the Raid 0+3 framework which is inefficient and has been deprecated for a while.

This will be the last version that has the old VST algorithm. The overclocking community has shown VT3 to be the stronger stress test.

Fixes:

Fixed "skip-warnings" not working on launcher warnings.

In memory of my uncle Robbie whom I was extremely close to and was effectively my 3rd parent growing up. Rest in peace. You'll be missed dearly. I will drive your Tesla someday, though it might be a while.

July 11, 2023

v0.8.1.9517

Windows + Linux

Changes:

Removed the N32 and Hybrid NTT algorithms. Old binaries will slow down from this.
Added VT3, a new implementation of the VST algorithm. New binaries will be faster.
Cilk Plus is now available on all 64-bit Windows binaries. This is expected to persist even after it is removed from the Intel compiler.
The default tests in the stress tester have changed.
The stress tester now shows performance details.
The Windows Thread Pool is no longer the default parallel framework for Windows due to negative interactions with the VT3 algorithm.

Due to the large performance changes, support for HWBOT submission will be withheld until HWBOT decides what to do.

More information:

August 31, 2022

v0.7.10.9513

Windows + Linux

New Features:

Added 22-ZN4, a new binary optimized for AMD Zen4 with AVX512.

The 18-CNL and 22-ZN4 Linux binaries in this release have not been tested.

June 19, 2022

v0.7.9.9510

Windows + Linux

Fixes:

Fixed the SFT stress-test not being invokable from the command line.
Fixed config saving not working.
Computation limit has been increased to 200 trillion digits.
Spot-check files have been updated with the latest records.
Added a new custom formula for Catalan's constant by Jorge Zuniga.
20-ZN3 is no longer automatically selected due to unexpectedly poor tuning.

March 14, 2022

v0.7.9.9509

Windows + Linux

Changes:

Added 20-ZN3, a new binary tuned for AMD Zen 3. Depending on your system and configuration, it may or may not actually be faster than 18-ZN2. You will have to experiment.

The 18-CNL binary has now been retuned for Intel Tiger Lake. It will still run on Cannon Lake since it doesn't use any new instructions.

The menu and configurations for digit output have been revamped to allow for more output formats in the future. For now, the only new setting is that you can enable/disable the use of raw I/O for digit output.

Fixed an issue in an upcoming version of Windows where multiple processor groups can be assigned to the same NUMA node.

Removed the 00-x86 binary as it is well outdated and bloating the download sizes.

All binaries have been retuned to reduce memory footprint at the cost of more computation. Performance changes can be in either direction (better or worse).

Compilers and libraries have been fast-forward by 2 years since this project went on hiatus.

This release has only been lightly tested.

September 14, 2020

v0.7.8.9507

Windows + Linux

New Features:

The swap multiply tester now shows how many data passes are needed to perform the forward FFT.
The stress tester now allows a global time limit as opposed to running endlessly.

Fixes:

Fixed another BufferTooSmallException in Custom Formulas.

March 1, 2020

v0.7.8.9506

Windows + Linux

New Features:

19-ZN2: A new binary optimized for AMD Zen 2 processors.

Fixes:

Fixed a buffer overrun in the Linux versions with the numa-interleave allocators.
Fixed an issue in custom formulas that may cause a BufferTooSmallException.
Improved detection for HPET on Windows for HWBOT.

October 27, 2019

v0.7.8.9503

Windows + Linux

Swap Mode Changes:

The "Swap Configuration" has been replaced with "Far Memory Configuration". In the past, you had to use the RAID-0/3 implementation. Now, you can now choose from multiple "far-memory" frameworks. In other words, you can now select far-memory frameworks like you can with the parallel frameworks and memory allocators. Currently, there are only 2 choices. But more are expected in the future.
- Legacy Disk RAID 0/3: This is the really old multi-level raid-file implementation that has been in use since v0.6.1.
- Disk RAID 0: This is new to v0.7.8. It is a dedicated RAID 0 implementation that is more efficient than the Legacy RAID 0/3.
There is now a "Far-Memory Tuning" menu that holds performance characteristics of the far-memory configuration. The "bytes/seek" parameter has been moved into this menu as a sub-parameter.

The legacy RAID 0/3 now has its own allocator settings for its I/O buffers. In the past, it used the same allocator as the one used for computational memory. This was decoupled as a part of the aforementioned far-memory refactor.

The new Disk RAID 0 framework allows per-path allocators. When combined with the NUMA-aware allocators, it becomes possible to place the I/O buffers on the NUMA nodes that are closest to their respective I/O devices.

Changes to Aide Large Computations:

The Disk RAID 0 framework supports error-detection checksums that should be able to detect most forms of silent I/O data corruption. This is enabled by default even though it has a small to moderate performance penalty.

You can now tell y-cruncher to run a system command after each checkpoint. This command is a blocking operation that will halt execution until it is finished. This makes it easier and safer to perform automated backups of the checkpoints.

The Custom Compute menu now gives an upper-bound for how much space is needed to store the largest checkpoint. This lets you properly budget storage for backups.

BBP Digit Extractor Revamp:

The BBP Digit Extractor has been revamped for the first time since v0.6.8. Previously, the digit extractor had an offset limit of 2⁴⁶. This meant that it couldn't verify Pi computations larger than 84.7 trillion decimal digits. The new version increases this limit significantly.

	Offset Limit
y-cruncher Version	Original Formula	Bellard's Formula
< v0.7.8	2⁴⁶	2⁴⁶
v0.7.8 (no AVX2)	~2⁴⁸	~2⁴⁹
v0.7.8 (AVX2 or later)	~2⁵⁸	~2⁵⁹

The new limit of 2⁵⁹ will likely be large enough for many years to come. But reaching these higher offsets will come at a performance cost as it is significantly slower to sum up terms with larger divisors. As a result, performance will be less linear and less predictable than before.

As part of this revamp, the BBP Digit Extractor is now faster than before for the sizes that were reachable in past versions.

Misc. New Features:

New Formula for Catalan's Constant: Jesús Guillera discovered another fast formula for Catalan's Constant. It is the 2nd fastest known formula for the constant and has been implemented natively in y-cruncher.

Reduced Memory Mode for Pi: Two additional "algorithms" for Pi have been added that run the Chudnovsky and Ramanujan formulas using less memory/storage at the cost of speed. The current memory reduction is about 5 - 15%. Since these are optimized for memory rather than speed, future versions of y-cruncher may improve the memory reduction at even more cost to speed. This feature has existed since 2013 and is finally being enabled in a public release. Though some older releases of y-cruncher already had it enabled as a hidden option.

Added a new parallel framework based on a simple task queue. This framework wasn't intended to be performant for computation. But it has been added to the framework list anyway.
Some of the parallel frameworks now have an option to better control the recursive fan-out behavior when dispatching a large number of tasks.
When resuming a computation, the resume date will be printed on the screen.
The exponent limit for the shift and power functions has been increased to the full 64-bit range.
Support for radicals has been added to the custom formulas.
Thread Building Blocks (TBB) is now supported in the dynamically linked Linux binaries.
All the non-AVX Windows binaries will now be able to utilize more than one processor group.

Optimizations:

The 18-CNL binary is now properly tuned for the Core i3 8121U. Though it will probably get re-tuned for Ice Lake in the future.
Most of the newer binaries have been retuned.
Some of the parallel frameworks have been re-tuned.
The power function in custom formulas is now faster for small inputs.

Removed Features:

Support for Windows Vista has been dropped. All Windows binaries now require Windows 7 or later.
The 16-KNL binary has been removed. y-cruncher never ran well on Xeon Phi and Intel has officially discontinued the entire processor line.
The 11-BD1 binary no longer uses XOP instructions. This allows it to be run on Zen 1 and Zen+ processors. The performance impact on Bulldozer processors should be minimal. Since XOP is no longer supported on any modern processors, it has become increasingly difficult and inconvenient to test this binary.

Other:

Lots of minor UI and configuration changes due to internal refactorings.
Performance fluctuations for the Windows AVX and AVX512 binaries due to compiler changes.

April 22, 2019

v0.7.7.9501

Windows + Linux

Fixes:

P5 - Fixed an issue where errors while creating, opening, or renaming a swap file won't print out the internal error message.
P3 - Fixed an incorrect entry in the digit spot check file for Log(10).
P3 - Fixed a possible bug in the memory calculations for certain binaries.

March 27, 2019

v0.7.7.9500

Windows + Linux

Fixes:

P2 - Fixed an issue where Zeta(3) - Wedeniwski would fail to resume from a checkpoint.
P5 - Improved error message in Linux when the disk is full.
P3 - Fixed some issues in the custom formulas.

March 10, 2019

v0.7.7.9499

Windows + Linux

Fixes:

P5 - Removed an outdated message about a million digits being the minimum.
P3 - Fixed an issue with the RAID3 after marking a drive as failed.
P4 - Fixed an issue where the program cannot immediately recover from a drive failure in RAID3 while renaming checkpoint files.
P5 - Fixed some incorrect error messages for certain file IO errors.
P4 - Fixed a crash in the I/O benchmark when using a large task decomposition for some situations.
P3 - Fixed an issue in the BBP digit extractor with a very large number of threads.
P5 - Fixed some issues with loading invalid configuration files.
P3 - Fixed a corner-case bug for some Newton's Method iterations.
P5 - Fixed a minor bug with invalid input handling for command line arguments.
P3 - Fixed a crash in Swap Mode for some custom formulas.
P5 - The year for the Chudnovsky formula has been changed to 1988. The original year (1989) was the wrong year.
P5 - Fixed a typo in Ramanujan's formula for Pi.
P4 - Fixed an issue where some swap mode computations over-estimate their required disk usage.
P3 - Fixed an issue where some custom formula functions use more disk than they are supposed to in Swap Mode.

鲍东方 found all but the last 5 on this list.

Other:

Added support for a new syntax for the custom formula Scope function that preserves ordering through JSON serialization.

February 24, 2019

v0.7.7.9498

Windows + Linux

Fixes:

P2 - Fixed a regression in v0.7.7.9497 where one of the many bugfixes triggered a compiler optimizer bug in VS2015 that would sometimes cause the relevant binaries to incorrectly calculate memory sizes. This is significant because it had a small potential to produce a computation plan that could not be finished. Bugs like this are normally caught during the extensive QA testing phase prior to a feature release. But these tests are not rerun for patch releases.
P4 - Fixed a crash in some low memory situations involving the swap files.
P4 - Fixed some issues with the RAID-3 functionality.
P5 - When the program crashes on mulitple threads simultaneously, the error and mini-dump messages are no longer overlapping.
P5 - Fixed multiple minor UI bugs.
P5 - The time limit in the stress-tester has been capped to 1 million seconds.
P3 - Fixed an alignment issue in some internal functions.
P5 - Fixed a couple minor UI bugs in the core selector menu of the stress tester.
P5 - Fixed a bug in the I/O benchmark when the bytes/seek is greater than the working memory.
P5 - Fixed a minor UI bug in the Digit Viewer.
P5 - Fixed an issue with running a stress-test from a configuration file with invalid core numbers.
P5 - Fixed an issue where custom formula labels weren't properly being sanitized.

Thanks to 鲍东方 for finding every bug in this list except for the P2 regression and one of the minor UI bugs.

February 16, 2019

v0.7.7.9497

Windows + Linux

Fixes:

P3 - Fixed an issue where a short number can error the program. This is only possible to hit from the custom formulas.
P5 - Fixed an out-of-date warning for extremely slow series.
P5 - Fixed a bug in the Euler-Mascheroni Constant that may cause it to display an incorrect value for the disk usage.
P5 - The length limit for custom formula strings now counts Unicode code points instead of UTF-8 code-units.
P4 - When running a swap mode computation from the command line, the "bytes/seek" parameter wouldn't auto-update when the swap configuration is set.
P5 - Fixed an issue where certain invalid UTF-8 sequences will cause y-cruncher to hang.
P3 - Fixed some more BufferTooSmallException errors in the custom formulas.
P3 - Fixed a spurious error in Newton Method implementations that can only be hit with custom formulas.
P3 - Fixed an issue with SeriesBinaryBBP for large power coefficients.
P4 - Fixed a bug in the RAID-3 quick configure.

Huge thanks to 鲍东方 for systematically trying to break the program in every way possible. He found all but one of the bugs on this list! Most of these are years old bugs that are now being exposed by the custom formulas.

February 2, 2019

v0.7.7.9496

Windows + Linux

Fixes:

P5 - Removed some extraneous warnings for unusual series and extremely large computations.
P4 - Fixed a crash in the I/O benchmark when using a very small amount of memory.
P3 - Fixed a bug in the custom formulas that can cause a BufferTooSmallException.

Once again, 鲍东方 has reported two of these.

January 23, 2019

v0.7.7.9495

Windows + Linux

Fixes:

P4 - Partial work-around for system error code (1453) which may happen while writing digits at the end of a large computation on systems with very little memory.
P5 - Improved error message for error code 1453 to hint at possible low memory situation.
P4 - The Gamma(1/4) digit file should now properly decompress on non-English locales.
P5 - Errors when loading custom formula files are now handled more consistently.
P5 - Error messages from simultaneous swap file errors will not overlap as badly as before.

Thanks to 鲍东方 for reporting several of these.

January 10, 2019

v0.7.7.9494

Windows + Linux

Fixes:

P4 - Series polynomials of degree larger than 62 are now disallowed. Polynomials of degree 63 or higher will hang the program.
P3 - Certain classes of polynomials will cause y-cruncher to error or enter an infinite loop.

Thanks to 鲍东方 for reporting both of these!

January 5, 2019

v0.7.7.9493

Windows + Linux

New Features:

Custom Formulas: y-cruncher can now run a limited set of user-specified formulas. This allows the user to compute more than just the built-in constants. It also allows the user to implement alternative formulas for the existing constants. Sample formulas have been included in the y-cruncher downloads.

Catalan's Constant Algorithms: 3 new algorithms have been added for Catalan's Constant. Due to the speed of these new algorithms, the existing Lupas and Huvent formulas are now outdated. But they will remain in y-cruncher for the time being.

Lemniscate with AGM: The AGM algorithm has been implemented for Lemniscate. This algorithm is about 2x faster than Gauss' formula in memory. But may be slower in Swap Mode.

Brent-McMillan Formula with explicit n: This allows you to run the Brent-McMillan formula for the Euler-Mascheroni Constant with a manually specified n parameter. This is somewhat of a useless feature that has existed internally since 2013.

Digit and Validation Output Path Improvements:
- The "-o" command line option will now be used for both the digit output as well as the validation file.
- A new "-od" option has been added to suppress digit output. This is the new method for disabling output.
- Added a command line option to set the priority which y-cruncher runs at.
- When output verification is enabled, there is now an additional verification step to detect data corruption of the source data during digit output that may go undetected in previous versions.

The TBB parallel framework now has sub-options to tinker with.

There is now a command line option to override the process priority that y-cruncher will run at.

Removed Features:

ArcCoth(x) has been removed. It has become superfluous as it remains accessible from the Custom Formula feature.

Re-Tunings:

AMD Jaguar processors will now choose the 11-SNB binary instead of the 08-NHM binary.
The TBB parallel framework now uses slightly different library calls which may affect performance.

Other:

The file extension for configuration files has been changed to ".cfg".
The naming of some of the algorithms has been changed.
Minor speed differences due to numerous internal refactorings.
Non-fatal errors that occur during memory allocation are now logged to the event log.
More run-time warnings will now propagate to the client in Slave Mode.
The format for the PauseWarning message in Slave Mode has been updated.

December 12, 2018

v0.7.6.9491

Linux

Fixes:

P3 - Fixed an issue in the Linux versions where old checkpoint files wouldn't get deleted.

December 2, 2018

v0.7.6.9490

Windows + Linux

Fixes:

P2 - Fixed a serious bug in the Digit Viewer that may cause y-cruncher to incorrectly compress digits.
P4 - The stress tester will no longer crash when it fails to allocate memory.
P3 - Fixed some cases in Swap Mode where it will under-estimate how much storage is actually needed.

November 19, 2018

v0.7.6.9489

Windows

Fixes:

P3 - Fixed a critical 32-bit integer overflow bug in the 32-bit Windows binaries. This was introduced by the 9488 patch and was caught a mere hours after it was released.

v0.7.6.9488

Windows + Linux

Fixes:

P3 - Untested fix for the NUMA binding issue on 4-die Threadripper that may cause performance to drop by up to 50%.
P3 - Fixed some issues with really small swap mode computations.
P5 - Fixed some minor UI bugs in the node-interleaving frameworks for Linux.

October 14, 2018

v0.7.6.9487

Windows + Linux

Fixes:

P3 - Fixed an issue that may cause very small swap mode computations with a high task decomposition to fail.
P4 - Fixed a crash when saving a configuration in Linux with the libnuma Node-Interleaving allocator.

September 16, 2018

v0.7.6.9486

Windows + Linux

Fixes:

P4 - Fixed a bug that causes y-cruncher to crash when saving a configuration file with the Cilk Plus framework.
P4 - Fixed a bug that causes the program to pause when running the Euler-Mascheroni Constant in swap mode.

September 6, 2018

v0.7.6.9485

Windows + Linux

Ugh... As expected, this is turning into one of the buggier releases...

Fixes:

P1 - Fixed a bug in the Euler-Mascheroni Constant in swap mode where it may fail to resume from a checkpoint.
P4 - Fixed the "-TD" option for Pi benchmarks.

August 28, 2018

v0.7.6.9484

Windows + Linux

Fixes:

P3 - Fixed a bug in the Euler-Mascheroni Constant that may lead to a BufferTooSmallException.

New Features:

Added a "pause:-2" option that will also suppress pauses even in the event of an error.

August 11, 2018

v0.7.6.9483

Windows + Linux

New Features:

18-CNL: A new binary for Cannon Lake processors that uses the AVX512 IFMA and VBMI instructions. This binary remains untuned and not fully tested. It will be updated in the future along with a proper name when the hardware becomes more readily available.

Digit Output Improvements:
- Digit output can now be suppressed. This allows benchmarks to be scripted without hammering the storage device with writes.
- When hexadecimal digits are enabled, the validation file will include the last 50 - 100 hexadecimal digits the same way that it has always included the last 50 - 100 decimal digits.
- When hexadecimal digits are enabled, the computation will include a "hex hash". This pairs with the existing "dec hash".
  - Dec Hash = Floor(x * 10^dec) mod (2^61 - 1)
  - Hex Hash = Floor(x * 16^hex) mod (2^61 - 1)
- The Digit Viewer has been revamped and completely rewritten:
  - It is now parallelized and better optimized.
  - Digit hashing and frequencies are now done automatically with compress/decompress streaming.
  - Digit reading is now done using raw (unbuffered) disk I/O. This is needed to avoid the Pagefile Thrash of Death on Windows.
  - More details here: https://github.com/Mysticial/DigitViewer
- All computations will compute an additional 100 digits beyond the requested precision. These digits are not output anywhere and are kept secret from the user. Instead the SHA256 hash of digits as a 100-byte ASCII string will be written to the validation file.
  
  The purpose of this is to help verify subsequent world records. When a world record is set with this feature, there will be a SHA256 hash for the next 100 digits. When the record is broken again (with any application, not just y-cruncher), the person can prove it by producing the digits that hash to this SHA256 hash.

Slave Mode: Preliminary support has been added for Slave Mode. This allows 3rd party applications to control y-cruncher over TCP. The main motivation is to allow for a 3rd party GUI as there are no plans to do that within y-cruncher. More details here.

Note that this feature is still a work in progress. Therefore it is preliminary, incomplete, and not well documented yet. The current implementation is not expected to be fully usable and will require user feedback to mature.

Fixes:

P3 - Fixed an integer overflow bug that may cause 32-bit computations of e to fail when more than 2³² terms are needed.
P2 - The Pagefile Thrash of Death that may happen at the end of a computation has been fixed by the rewrite of the Digit Viewer.
P3 - Fixed an issue that may cause Swap Mode to fail on Linux when O_DIRECT isn't supported. Details here.
P4 - Fixed an issue that may cause the I/O Benchmark to fail with a "BufferTooSmallException".
P3 - Fixed a bug with the Euler-Mascheroni Constant that may cause the program to hang and consume lots of memory. This is caused by a pathological failure of the series summing logic due to the non-convergent behavior of the refinement term in the Brent-McMillan Formula.
P4 - Fixed a bug with the C malloc() allocator that may cause a "Buffer is misaligned." error in swap mode.

Optimizations:

Series summation is now more aggressively parallelized to improve CPU utilization on systems with many cores.
Slightly faster division and inverse square root.

Other:

Minor performance fluctuations across all processors due to internal refactorings and tuning parameter updates.
Minor UI changes due to internal refactorings.
Startup command line options are now documented in "Command Lines.txt".

Version 0.7.6 lands at the end of a very large refactor/rewrite of nearly all of y-cruncher's series summation code.

Due to the scope of this refactor along with the fragility of the original code, there is non-zero probability that something will break in this version. Please submit a bug report if you observe errors or large increases in memory usage from older versions of y-cruncher.

February 23, 2018

v0.7.5.9481

Windows + Linux

Fixes:

P3 - Fixed a bug that caused computations of e in ram only mode to compute e - 2 instead of e. This was due to some recent bad refactoring. It was not caught during testing since the spot checker only checks a subset of the digits.

January 21, 2018

v0.7.5.9480

Windows + Linux

New Features:

Preliminary support has been added for Thread Building Blocks (TBB) for all 64-bit Windows binaries. This was prompted by the Cilk Plus deprecation. Support for Linux may come at a later time.

When a soft error is encountered in the stress-tester, it will display the logical core # that it occurred on. This will aid the user in tracking down "weak" cores during overclocking.

A new verification step has been added to complement the base conversion verification. This step reads back the decimal digits that were output to disk to make sure they are intact. This protects against several classes of errors that were previously unchecked:
1. Errors in formatting the digits. (i.e. converting from binary to ASCII)
2. Errors during the compression of the digits.
3. Errors in writing the digits to disk.
Like the base convert verification, this digit output verification is not part of the computation. Therefore, it is also excluded from the official benchmark times. This also maintains benchmark comparability with older versions of y-cruncher.

Both base convert and decimal digit verification are now classified as "Output Verification". They are disabled by default for casual benchmarks and enabled by default in the Custom Compute menu. Enabling them will be mandatory when claiming world record size computations.

Fixes:

P2 - Fixed a bug that may cause the base conversion to fail with, "Overlap Mismatch" error.

Optimizations:

Improved CPU utilization on systems with a lot of cores.
More optimizations that benefit memory-bound systems.
More AVX512 optimizations.

Other:

Minor performance fluctuations across all processors due to internal refactorings.

December 2, 2017

v0.7.4.9478

Windows + Linux

Fixes:

P2 - Fixed a serious bug that could cause very large computations to fail a redundancy check.

This source of the bug is a refactoring that occurred between v0.7.3 and v0.7.4. Under very specific circumstances, it may cause a computational thread to overrun its designated scratch buffer and corrupt the memory of other threads.

The bug was missed by unit tests. And since it only reproduces on extremely large computations, it was also missed by the large scale integration tests. Thanks to Susumu Tsukamoto for catching this.

October 14, 2017

v0.7.4.9477

Windows + Linux

New Features:

Stress-Tester Revamp:
- New tests: BBP and SFT. These two tests are designed to stay in cache and avoid the memory bottleneck that plagues the Skylake X processor line.
- Full control has been added to choose which logical cores to test and which to leave idle.
- On Linux, thread pinning is now more robust to unusual hardware configurations like disabled cores.
- Tests now display how much memory is actually used.
- Tests now show a slider indicating whether it is CPU-bound or memory-bound.
- The stress-tester now has support for configuration files.

Fixes:

Fixed a bug that caused excessive memory usage with large task decompositions. It is now possible to do 25 billion digit Pi benchmarks within 128GB of memory with a lot of threads.
Fixed a subset of the BufferTooSmallException errors that occur in swap mode computations using a very small amount of memory. These are caused by incorrect pre-computation memory calculations. A more comprehensive fix is slated for the next feature release.

Retunings:

The "00-x86", "04-P4P", and "05-A64 ~ Kasumi" binaries have been re-tuned for AMD Phenom II.
"16-KNL" and "17-SKX" have been retuned for smaller caches.
Tuning parameters for most of the other binaries have been updated.

Optimizations:

Bulldozer: ~6%
Haswell: ~4%
Skylake: ~5%
Ryzen: ~6%
Skylake X: ~10%

September 14, 2017

v0.7.3.9475

Windows + Linux

Fixes:

Fixed a bug that may cause ram only computations of e to fail with "Memory Allocation Failed".

August 15, 2017

v0.7.3.9474

Windows + Linux

Fixes:

Fixed a bug that prevented the I/O benchmark from being run from a config file via command line.
Fixed a bug in the stress-tester that may cause improper thread binding.
Errors during stress-testing will be printed out immediately.
Speculative fix for bugged NUMA detection on AMD Threadripper and Epyc on Linux.

July 12, 2017

v0.7.3.9472

Windows

Fixes:

Fixed a bug in Windows involving processor groups that prevents the stress-tester and the Push Pool framework from setting thread affinities on AMD Epyc systems with exactly 128 vcores. (Thanks to Dave Graham for reporting this.)

July 6, 2017

v0.7.3.9471

Windows + Linux

New Features:

17-SKX ~ Kotori: A new binary tuned for Skylake Purley processors with full-throughput AVX512.

16-KNL: An untuned AVX512 binary that will run on Knights Landing Xeon Phi host processors.

Configuration Files:
- The Custom Compute and I/O Benchmark menus now have support for saving to and loading from configuration files.
- The functionalities for these menus can also be triggered directly from the command line with a configuration file.

Memory Allocator Improvements:
- New memory allocators that support node interleaving. This may improve performance on NUMA systems.
- Allocators now have sub-options that can be selected.
- The memory allocator can now be set with the command line options.
- On Windows, it is now possible to lock pages without using large pages.

Fixes:

Improved detection of available memory in Linux. In the past, it only detected unused pages. Now it detects all available memory including those used for caching which can be released if needed.

Invalid command-line options will now terminate the program instead of being silently skipped.

Fixed a bug in swap mode that causes the primary formula for the Euler-Mascheroni Constant to require much more memory than is necessary. This bug was introduced when fixing the "working memory is too small" bug in v0.7.2.

Fixed an issue in the Custom Compute swap mode menu that may cause it to fail an internal assertion. This is also related to the Euler-Mascheroni Constant memory calculations.

Other:

The dynamically-linked Linux binaries now have a dependency on libnuma due to the new node-interleaving allocators.
The concept of local and global "Min I/O Size" has been renamed and replaced with "Bytes per Seek".

The new features in this release may seem a bit loaded in terms of their magnitude given that it's only been a 4 months since the last major release. But in reality, most of the stuff has been a work-in-progress for a long time.

June 3, 2017

v0.7.2.9469

Windows + Linux

Fixes:

Fixed a serious bug in the Push Pool that could cause failures when limiting the # of threads. This bug was introduced in one of the many refactors between v0.7.1 and v0.7.2.

March 14, 2017

v0.7.2.9468

Windows + Linux

The refactorings from v0.7.1 continue into this release. Other than that, there are few user-visible changes.

Due to the time-constraints with trying to make this happen for Pi Day, this release is not as well tested as it should've been. So don't be surprised if stuff breaks.

Fixes:

The Linux binaries will now properly read the CPU topology on multi-socket systems.
Fixed an integer overflow bug in I/O benchmark for the x86 binaries.
Fixed a corner case where computations of the Euler-Mascheroni Constant may fail with, "Working memory is too small."

New Features:

17-ZD1 ~ Yukina: A new binary tuned for AMD Zen processors.

In Windows, y-cruncher will automatically create a minidump file if it crashes. This should make it easier to debug crashes which cannot be reproduced locally in the development environment.

The BBP app now has the option to pin all threads to different cores. This solves the problem of imbalanced processor groups.

Optimizations:

Minor speedups for most processors:
- Penryn: ~6%
- Nehalem: ~7%
- Sandy Bridge: ~5%
- Bulldozer: ~7%
- Haswell: ~7%
- Skylake: ~5%
- Zen: ~14%

Retunings:

The default "Min IO" parameter has been raised to 1 MB to keep inline with modern hard drives.

Other:

The binaries have been renamed. Instead of using the name of an instruction set, they now use a year and an acronym for the processor architecture instead. This was done because there are simply too many instruction sets and newer ones don't necessarily imply the existance of all the older ones.

September 16, 2016

v0.7.1.9466

Windows + Linux

Fixes:

Fixed a bug in swap mode that causes objects to under-allocate. This sometimes leads to assertion failures about writing beyond the end of a swap file. This bug dates all the way back to the "risky" refactorings from v0.6.6. This took almost 2 years to catch because it only seems to affect the Euler-Mascheroni Constant.

For what it's worth, v0.7.1 has been surprisingly stable given the sheer magnitude of the internal changes.

May 16, 2016

v0.7.1.9465

Windows + Linux

This is the first version since v0.6.1 that is dedicated primarily to paying back technical debt. While there are no major functional changes, expect to see a lot of minor differences from v0.6.9 which are remnant of the internal refactorings.

New Features:

Spot Checking: For computations of major constants, the digits are automatically spot-checked against a table of known digits. This was originally an internal feature meant to streamline QC testing. But it turned out to be useful enough to enable publicly. This feature completely replaces the ad hoc digit-checker used for Pi benchmarks.

x64 ADX ~ Kurumi: A new binary tuned for desktop Skylake processors. It utilizes the new add-with-carry instructions. This binary will also run on Broadwell processors.

The x86 binary (no SSE) is back. It disappeared for all of v0.6.x for performance reasons, but it's back for v0.7.1. The binary won't be able to run on any old processors anyway since Windows Vista or later is a requirement. But it's there for the purpose of comparing the effects of the various instruction sets.

Disk I/O buffer sizes are now configurable. In the past, they were locked to 64MB/path. This resulted in suboptimal performance when the path(s) were themselves a RAID of multiple drives.

When entering the number of digits, you can now specify suffixes. (e.g. "25m", "10b")

y-cruncher will now attempt to use large pages and lock them in memory to prevent destructive disk swapping by the OS.

HWBOT Integration Improvements:

Improved detection of the processor topology.
Added detection of the operating system version.
Added detection for the motherboard and memory configuration.
Added detection of the reference clock.
The validation files have been renamed so that they don't overwrite each other anymore.

Fixes:

Previously, y-cruncher would fail to pick up the username when running the binaries directly. This has been fixed.

Unicode is now properly handled everywhere in the program. Note that the ability to display unicode characters is still subject to the limitations of the console window.

Fixed a bug where checkpoint-restart would fail if any path had an equals character (=) in it.

The stress-tester and BBP app will now be able to use multiple processor groups on Windows.

Optimizations:

Global retunings for all processor-specific binaries. Expect both speedups and slowdowns across the board for all computations and on all processor targets. This is most noticable on older processors.

A custom thread pool has been added to the parallel frameworks. This is the default for Linux. On Windows, it is the default when there is only one processor group.

Swap mode computations are slightly more aggressive with keeping things in memory.

The initial memory allocation is now parallelized.

Other:

y-cruncher no longer requires administrator privileges to run basic computations. Instead, it will request that it be re-run with elevation for the following features:
- Swap mode in Windows: Requires "SeManageVolumePrivilege".
- Large pages in Windows: Requires "SeLockMemoryPrivilege".
- Locked pages in Linux: Requires "CAP_IPC_LOCK".

In Windows, when you run y-cruncher by double-clicking, it will set the console window dimensions to 80 x 25. Windows has historically defaulted to a window size of 80 x 25. But in Windows 10, they increased this to 120 x 30 - which looks a bit weird.

The Custom Compute menu options have been rearranged a bit.

The licensing has been reworded to explicitly allow the use of y-cruncher for tech reviews even if they are for commercial purposes.

March 1, 2016

v0.6.9.9464

Windows + Linux

Fixes:

Checkpoint restart will now properly save and restore the parallel framework. Previously, it would always load the default framework upon resuming a computation from a checkpoint. This bug has existed since v0.6.8 when parallel frameworks were added.

The framework threads will also be saved and restored across checkpoints.

Fixed a bug where the program may improperly detect the number of logical processors on Windows Server 2012 R2. This may also apply to other versions of Windows as well. (Thanks to Mike A for reporting this and suggesting a fix.)

Fixed a bug in the x86 binaries that would prevent the I/O benchmark from using more than 4GB of disk.

Fixed a bug in the Digit Viewer where it fails to print new lines when counting digit frequencies.

Optimizations:

The threshold for defaulting to Cilk Plus on Windows has been increased from 33 to 65 logical cores.

December 5, 2015

v0.6.9.9462

Windows + Linux

New Features:

On Windows, partial support has been added for Processor Groups*:
- The AVX, XOP, and AVX2 binaries are now able to detect all the logical cores in the system even if it exceeds 64.
- The AVX and AVX2 binaries will default to Cilk Plus when there is more than one processor group.
Support for Cilk Plus has been added for Linux.
The maximum task decomposition has been increased from 256 to 65,536.

*This feature has not been added to the older SSE binaries because it requires Windows 7. As of v0.6.8, y-cruncher maintains backwards compatibility with Windows Vista. But all the AVX binaries require Windows 7 SP1 anyway, so nothing is lost by using Win7-specific API calls.

Fixes:

The app will no longer crash when AVX is disabled on a processor that supports it. This was a Visual Studio bug that was fixed by upgrading to Visual Studio 2015.
The command line option, "-C:-1" for compressing to a single file has been fixed. This was due to the (incorrect) use of an unsigned integer parser which parses -1 as zero thereby disabling compression.
The dispatcher will select AVX2 instead of SSE3 for AMD Zen.
Swap mode computations may cause excessive disk swapping by the OS. This has been alleviated by reducing the default memory usage from 31/32 of available memory to 15/16.

Changes:

The Linux version is now available in both static and dynamic binaries. The static binaries are the most portable, but they lack Cilk Plus. The dynamic binaries support Cilk Plus, but has a dependency on Glibc-2.19 (and possibly others). This unfortunate DLL hell is because Intel refuses to provide Cilk Plus as a static library.
The ".out" extension for the Linux binaries has been removed. This was some very old legacy stuff that had something to do with GCC outputting "a.out" by default.
On Linux, Cilk Plus is the default for everything. This will improve the performance for small computations and on systems with many cores. But it may also cause minor performance regressions under some situations.

Optimizations:

On Windows, the binaries that support Cilk Plus will default to Cilk Plus when there are more than 32 logical cores.
Minor speedups for specific architectures.

Compiler Upgrades:

Windows: Visual Studio 2013 -> 2015
Linux: GCC 4.8 -> 5.1

May 7, 2015

v0.6.8.9461

Windows + Linux

Fixes:

Fixed a bug in the x86 binaries that was causing excessive memory usage.
Fixed a performance regression from v0.6.7 of up to 10% on older processors. This was the result of a bad refactor that accidentally replaced the size of the LLC (last level cache) with that of the L1 cache. Oops...
Fixed a bug in the BBP app where the compile-time CPU-dispatcher caused the "x64 SSE4.1 ~ Ushio" binary to use the "default" path instead of the SSE4.1 path.

March 17, 2015

v0.6.8.9460

Windows + Linux

Fixes:

Fixed a performance regression from v0.6.7 that could slow down Haswell processors by as much as 5%.
Fixed a bug in the x86 binaries that would cause stack corruption when calculating the memory requirements for a computation larger than ~500 billion digits. This bug has been there since v0.6.1.

March 14, 2015

v0.6.8.9458

Windows + Linux

New Features:

The BBP side-project from 5 years ago has been revived, rewritten, and integrated into y-cruncher.
The Digit Viewer has also been re-integrated into y-cruncher.
The multi-threading framework has been revamped. Now you can choose from several parallel computing frameworks. This feature is mostly experimental for now as the options vary by binary and are quite limited on Linux.

Fixes:

Fixed a rare but potentially serious bug in the basecase multiplication where a carryout may be missed. This was introduced in v0.6.6 when y-cruncher started using add-with-carry intrinsics. This only affected the 64-bit Windows binaries. The Linux versions are unaffected since GCC lacked these intrinsics.

Limits:

The theoretical limit of y-cruncher has been increased to 10¹⁵ decimal digits for all constants.

Optimizations:

The Windows versions now use thread pools by default.
Other miscellaneous optimizations mostly affecting AMD processors.

Changes:

The concept of "threads" has been replaced with "Task Decomposition" and "Parallel Framework". In the past, a computation was run using N threads. Now it is run using framework X with a task decomposition of Y. Both the framework and the task decomposition can be manually set.

February 8, 2015

v0.6.7.9457

Windows + Linux

Swap Mode Improvements:

Swap mode computations will now create a folder for the swap files. When multiple paths are used, each folder will have a unique name. This makes manual backups easier.
A swap mode multiplication tester has been added to the advanced options. Anyone who is attempting a world record with y-cruncher should first run this tester to sanity check the program's ability to do arithmetic at the target precision.
The I/O Benchmark has been revamped:
- The old benchmark for strided access was vulnerable to request coalescing which cannot happen in a real computation. This was producing inflated bandwidth numbers. This has been somewhat mitigated in this version.
- The recommendations have been updated to be more relevant for newer hardware.*

*Based on the runtime stats of the 12.1 and 13.3 trillion digit Pi computations, the new recommendation is to have an IO/compute ratio of 2.0. (i.e. the disk bandwidth should be double the compute bandwidth.) But the reality is that this will be very difficult to achieve on a modern high-end processor using conventional hard drives. This is the unfortunate result of the ever increasing performance gap between CPU and disk.

For example, a stock i7 5960X would require around 6 GB/s of disk bandwidth. At 100 MB/s per hard drive, that would be 60+ drives assuming linear scaling. While it is easier to do it using SSDs, they are smaller in size. Furthermore, the sheer volume of writes that y-cruncher will issue means that if you want to use SSDs, you need to be willing to expend them like consumables.

Fixes:

Fixed a serious bug that would cause large multiplications to hit an assertion failure under the right conditions.
In Linux, swap files are now created with only the necessary permissions. (read+write for owner)
In the past, they were created with permissions 777 (everything). This was a huge oversight dating back to 2010 when the program was first ported to Linux.
Fixed a problem where swap mode in Linux wouldn't clean up the "pathcheck.ysf" file.
Fixed a minor console coloring problem.
Fixed a minor bug in swap mode that could cause suboptimal algorithm selection.
Fixed a bug in the multi-layer raid-file implementation that may cause the program to crash. This bug has existed since v0.6.1 and is extremely rare - requiring degenerate input data which may be impossible outside of a developer build.

Optimizations:

The XOP and AVX2 binaries are now faster.
The default "Min-IO Size" parameter has been reduced from 1MB to 256k.

Other:

On Windows, the stress tester now runs in the lowest possible priority to ensure responsiveness of CPU monitoring programs which also run in low priority.
On Windows, the default process priority has been reverted to "Below Average" for the same reason. As always, this priority can always be changed via Task Manager.
The minimum value for the "Min-IO Size" parameter has been reduced from 256k to 4k. SSDs as well as some hard drive configurations have very low average seek latencies which can benefit from a smaller Min-IO parameter.

December 21, 2014

v0.6.6.9452

Windows + Linux

Fixes:

Fixed a problem where the "Min I/O" parameter would be incorrectly set when resuming from a computation using more than one drive. This bug has performance consequences for large swap mode computations.
Fixed an issue that could prevent the program from detecting errors when performing swap mode multiplications. Anything that could allow errors to go undetected is serious since it means that a computation could finish with the wrong digits.

November 28, 2014

v0.6.6.9451

Linux

Fixed a problem reported by Matt Hesse where the Linux SSE3 binary would instantly fail on older Intel processors.

The problem is that the SSE3 binary was (incorrectly) compiled with "-march=barcelona" when it should have been "-mtune=barcelona". This will enable AMD-specific Advanced Bit Manipulation (ABM) instructions which did not exist on Intel processors prior to Haswell.

November 19, 2014

v0.6.6.9450

Windows + Linux

Fixed a bad regression in swap mode where disk I/O errors would not stop the computation. Instead, they would be ignored and the computation would continue with corrupted data.

Blame this on bad refactoring and insufficient corner-case testing...

November 5, 2014

v0.6.6.9449

Windows + Linux

In memory of my grandfather who passed away last month. He loved numbers and is probably why I do too...

New Features:

Command Line Options
A "Stop on Error" option in the stress-tester to halt when an error is detected.

Critical Fixes:

Fixed an issue that can cause failures when using a non-power-of-two number of threads.
Fixed an issue that will cause most computations larger than 20 trillion digits to fail. This bug has existed since v0.6.1 and is caused by an off-by-one error in the creation of the precomputed twiddle factor tables.
Fixed an issue in the 32-bit versions that would cause Pi computations larger than 40 billion digits to fail the base convert verification. This was caused by an integer overflow in the "size_t" datatype when a 64-bit file offset should have been used instead. This bug has existed since v0.6.1 and was never caught because 32-bit y-cruncher has never been tested at such large sizes until now. (v0.6.1 does not support swap mode for Pi, so v0.6.2 is actually the earliest version that is affected.)

Minor Fixes:

Fixed some crashes that may occur when using a very large number of threads.
Fixed an issue when attempting to print a line that is longer 79 characters.
Fixed an issue in the I/O Benchmark that may cause parity failures under RAID 3.
Fixed another issue with detecting OS support for AVX.

Optimizations:

Swap mode computations now require ~10% less disk space than before.
Minor overall speedups.

Retunings:

In Windows, the default process priority has been changed to "Normal". It used to be "Below Average".
Threads that perform disk I/O now run at the maximum priority that the OS will allow.
The program has been retuned in various ways. So expect some performance differences (both up and down) from the previous version.
The SSE3 binaries have been retuned in favor of larger computations. So they are faster for large computations, but slower for small ones. (similar to the SSE4.1 and AVX binaries in v0.6.3)

Internal Changes:

The compilers have been upgraded:
- Intel Compiler (13 -> 14)
- Visual Studio 2013 (Original -> Update 2)

August 23, 2014

v0.6.5.9444b
(fix 2)

Windows + Linux

Fixed a bug in the base conversion that may cause a computation in swap mode to enter an infinite loop.

(Credit to Yifang Sun for discovering this.)

This bug has existed since v0.6.1 and only occurs when the number of threads is small (1 or 2). The cause of the bug is a subtle design flaw in the handling of misaligned data sizes. The fix was to align the data and delete the flawed misalignment code.

July 24, 2014

v0.6.5.9443b
(fix 1)

Windows + Linux

Fixed a serious bug in the swap file implementation that may cause incorrect disk I/O leading to data corruption and ultimately an incorrect computation.

This bug only affects swap mode and is most likely to affect the Euler-Mascheroni Constant's primary algorithm.

The only change in this version is a single if-statement. Nevertheless, all bugs that affect the correctness of the computed digits are considered "serious" bugs that warrant an immediate patch and an unscheduled release.

May 25, 2014

v0.6.5.9442

Windows + Linux

New Features:

x64 AVX2 ~ Airi: A new binary tuned for Intel Haswell processors. It uses AVX2, FMA3, and BMI2 instructions.
The CPU dispatcher has been added to the Linux version.
Attempting to run a binary that is not compatible with the host will give a warning. An incompatible binary may still be able to run. But don't count on it.

Fixes:

Fixed the detection of AVX support by the OS. Previous versions only checked the OS version. It did not check if XSAVE was actually enabled to run AVX.
Units for memory have been changed to MiB/GiB/TiB etc... They have always been in binary, but the old labels were unclear as to whether it was binary or decimal.
The Linux version will attempt to reset the console colors to their defaults when the program exits.

Removed Features:

x64 SSE4.1 ~ Nagisa: This binary was tuned for my Harpertown workstation (Core 2 Penryn). But Core 2 is pretty old now and this binary has become redundant of the "x64 SSE4.1 ~ Ushio" binary. So not much is lost by dropping it. Getting rid of it also reduces the size of the download.

Internal Changes:

The entire CPU dispatching framework has been completely rewritten from scratch.
The underlying tool-chain has been completely revamped for this release. The AVX and AVX2 binaries use the Intel Compiler 13. The rest of them use Visual Studio 2013.
No optimizations have been done since the last release. But because of the compiler upgrades, there will be slight differences in performance since the previous version.

March 14, 2014

v0.6.4.9424

Windows + Linux

New Features:

x64 XOP ~ Miyu:
- A new binary tuned for AMD Bulldozer line processors. It uses FMA4 and XOP instructions.
- Up to 10% faster than the SSE3 binary. (usually about 8% faster for large computations)

Fixes:

Fixed a bug in the inverse square root code that may cause the routine to under-estimate the amount of memory that is needed. The result is memory corruption when the computation uses more than is allocated.

February 21, 2014

v0.6.3.9416b
(fix 1)

Windows + Linux

Bug fixes. One of which was bad enough to warrant patching v0.6.3 instead of waiting for v0.6.4.

Fixes:

Fixed a display issue where the secondary formula for Log(n) would show up as the "Primary Machin-Like Formula".
Fixed an issue in the Stress Tester where certain classes of soft-errors would not get reported.
Fixed a serious bug in the SSE3 binaries that would sometimes cause very large multiplications to fail.*

*This bug was introduced in v0.6.3 when the VST algorithm was refactored. I rewrote some of the processor-specific macros and screwed up the SSE3 version. The bug is rare and does not affect the SSE4.1 or AVX binaries.

This slipped through internal testing because it is rare and the SSE3 binaries are not tested as much as the SSE4.1 and AVX binaries.

December 29, 2013

v0.6.3.9415

Windows + Linux

The Digit Viewer has been (almost) entirely rewritten. It has been open-sourced on my GitHub.

New Constants:

Lemniscate Constant
Euler-Mascheroni Constant: Not really new to y-cruncher, but it's back after disappearing for two versions.

New Features:

The Digit Viewer can now count the # of occurences of each digit.
The same digit counts are now logged in the validation files. This is useful for making sure that the process of writing the digits to disk is actually correct.
Errors are now more descriptive. This should aid in identifying whether issues are software or hardware related.
The validation files now contain a hash for the decimal digits. This can be used to verify that digits are correctly written to disk.

Fixes:

The Linux versions now correctly use O_DIRECT for raw I/Os.
AMD Bulldozer line processors will now choose "x64 SSE3 ~ Kasumi" instead of "x64 AVX ~ Hina". This is because 256-bit AVX performance on Bulldozer and Piledriver processors is worse than SSE3.

Tweaks:

The "x64 SSE4.1 ~ Ushio" and "x64 AVX ~ Hina" binaries have been retuned. They are slightly faster for large computations (> 1 billion digits). But performance for small computations has been decreased slightly.
The VST algorithm has been refactored in preparation for future instruction sets. So there will be some performance differences.

Changes:

The VST algorithm warning for AVX-capable processors has been removed. Prime95 appears to be more stressful now that it has support for AVX.
Checkpoints are now overlapping. Old checkpoints are not destroyed until the new checkpoint has been made. Previously, there were small portions of time when there would be no checkpoint alive. So if an error occured at just the right time, you'd be screwed.

June 16, 2013
June 30, 2013

v0.6.2.9316
v0.6.2.9322

Windows + Linux

This release of the v0.6.x series adds the swap modes and checkpoint restart.

New Features:

Huvent's Formula for Catalan's Constant.
Swap Modes and Checkpoint restart have been added for all constants that v0.6.2 supports.
Resuming from a checkpoint will automatically remove all non-checkpoint files leftover from an interrupted computation.

Note that the checkpoint-restart in this version (0.6.2) has more granularity than in earlier versions. Checkpoints can now be made deep into the Binary Splitting recursions of all the series-based constants. Therefore the average time between checkpoints is much smaller and less work is lost upon a failure.

Changes:

All versions are now compiled using Visual Studio 2012. There may be slight changes in performance for all versions except the "x64 AVX ~ Hina" binary.
(Internal) - v0.6.2 now has a C++11 dependency. All prior versions of y-cruncher were compilable with only C99.

Regressions:

Swap modes require more disk space than in v0.5.x. This is because the checkpoint-restart requires that new checkpoints are written before (or shortly after) the previous checkpoint is destroyed. The result is that most operations must be done out-of-place thus increasing memory usage.
In v0.5.x, checkpoints were only done in places where disk usage was low. So the pressure of needing to preserve extra data was not felt. But in v0.6.2, checkpoints are everywhere - including places with high disk usage.

Fixes:

Fixed a bug where, in some scenarios, the program will halt with "Memory Allocation Failure".
Fixed some bugs that would cause small computations to error.
Fixed tons of other minor corner-case bugs.

Postponed features:

XOP and AVX2 support has been put off until I acquire the hardware.
Euler-Mascheroni Constant hasn't been started yet. (Although I should probably do this since that's what the "y" in "y-cruncher" stands for. Doesn't do much justice for a program to be missing its mascot. :P)
Lemniscate's Constant is "extremely low priority".

One thing worth noting is that the error-correction in v0.6.x is less aggressive than in v0.5.x. Version 0.6.2 relies more heavily on the new checkpoint-restart system to recover from errors. So instead of attempting to correct errors on the fly, v0.6.2 will usually just terminate the computation - thereby forcing the user to resume from the last checkpoint.

Note that the error-detection is just as aggressive as before. It's only the error-correction that has been laxed.

February 17, 2013

0.6.1.9282

Windows Only

After such a long rewrite, this is the first of the v0.6.x series. Not all features are implemented yet. Most notably, the majority of the constants are still missing swap mode. But it's good enough for ram-only benchmarks.

This release is mostly to put an end to the nearly 2-year break of releases.

New Features:

More Detailed Output: The program will display much more detailed output during a computation. This feature used to be developer and private-version only. But it is now enabled in all public releases.
Component Stress-Tester: Runs individual algorithms in y-cruncher to stress different parts of the system.
I/O Benchmark: Benchmarks and evaluates a swap configuration for large computations.
Multi-layer Raid-File:
- Hybrid RAID 0+3 swap-file management to allow for redundancy while preserving the multi-HD functionality.
Compilation Options: An option that displays how each binary has been compiled. Originally a developer-only feature for development purposes, it has been enabled in the public release to satisfy those who are curious.

Removed Features:

Batch Benchmark Pi
The Stress-Tester from v0.5.5 and earlier.
Support for x86 without SSE3.
Basic Swap Mode

Changes:

Privilege elevation is now required to run y-cruncher. This should put an end to all those file allocation problems (which have gotten worse in Windows 8 due to the different UAC settings).
Windows Vista or higher is now required to run y-cruncher.
"Advanced Swap Mode" renamed to simply "Swap Mode".
The division step at the end of each series computation has been separated and given its own timer.
The "Frequency Sanity Check" has been disabled.
The validation files now include detailed event logs of the computation.

Limits:

All constants have a hard-limit of 90 trillion digits.

Optimizations:

Faster Division
Faster Square Roots
Faster Base Conversion
Faster Multiplication for large products > 50 billion digits
Numerous other minor optimizations

Fixes:

Detection for AVX on Windows 8 has been fixed.

Missing Features: The following features are not complete and will not be in this release. All of these exist in v0.5.5, so use that if you wish to use these features.

Checkpoint Restart
Swap Modes for: e, Pi, Log(n), Zeta(3), Catalan's Constant
Euler-Mascheroni Constant

2013 ???

0.6.x

Version 0.6.x will be the first major rewrite of y-cruncher. The following changes are planned. Everything is subject to change.

Due to the large feature set, these will be rolled out incrementally over multiple versions of v0.6.x.

New Features:

(v0.6.1) Component Stress-Tester: Runs individual algorithms in y-cruncher to stress different parts of the system.
(v0.6.1) I/O Benchmark: Benchmarks and evaluates a swap configuration for large computations.
(v0.6.1) Multi-layer Raid-File:
- Hybrid RAID 0+3 swap-file management to allow for redundancy while preserving the multi-HD functionality.
- Failed drives can be replaced on the fly without rolling back to a checkpoint.
(v0.6.3) Lemniscate Constant: Arc-length of a lemniscate = 5.24411510858423962...
(v0.6.3) Swap Mode: Added for the Euler-Mascheroni Constant.

Removed Features:

Batch Benchmark Pi
The current Stress-Tester
Support for x86 without SSE3.
Basic Swap Mode
Basic Swap Mode optimizations/algorithms*

Changes:

(v0.6.1) "Advanced Swap Mode" renamed to simply "Swap Mode".
(v0.6.2) Ramanujan's formula for Catalan's Constant will be replaced with Huvent's BBP. (and will support Swap Mode)
(v0.6.1) Completely overhauled Benchmark Validation system.

Limits:

(v0.6.1) All constants will have a hard-limit of 90 trillion digits.

Optimizations:

(v0.6.4) x64 XOP: Specially tuned for the AMD Bulldozer processor line.
(v0.6.1) Faster Base Conversion.
(v0.6.1) Faster Multiplication for large products > 50 billion digits.
(v0.6.2) More fine-grained Checkpoint-Restart.
(v0.6.3) Raw I/O support for Linux.
Numerous other minor optimizations.

*Due to incompatibilities with new and future planned optimizations, Basic Swap Mode and the optimizations/algorithms that come with it (which are also used by Advanced Swap Mode) will be omitted in v0.6.x.

This, along with the rewrite, means that v0.6.x will be the first version of y-cruncher that will not be strictly faster than an earlier version. This means that some computations, under certain situations, will actually be slower in v0.6.x than with v0.5.5.

April 6, 2011

0.5.5 Build 9180
(fix 2)

Windows Only

Fixes:

The x64 AVX ~ Hina binary is now compatible with non-Intel processors. As a side-effect, the x64 AVX ~ Hina binary is about 1% faster on Intel processors as well.
The contact email address has been changed to [email protected].

February 20, 2011
February 20, 2011

0.5.5 Build 9179
(fix 1)

Windows + Linux

Fixes:

Fixed a major bug in the ArcCoth code that may cause incorrect computation of all dependent constants:
- Log(2)
- Log(10)
- Euler-Mascheroni Constant

This bug has probably been present in y-cruncher since v0.4.1.

February 1, 2011
February 3, 2011

0.5.5 Alpha
Build 9178

Windows + Linux

New Features:

Support for the new Advanced Vector Extensions (AVX) instruction set.

Changes:

All Windows binaries with SSE/AVX are now compiled using the Intel Compiler 11.1.
All Linux binaries are now compiled using GCC 4.4.5. Furthermore, all binaries are compiled as C code, not C++.
Minor changes in speed due to rewritten code.

Optimizations:

x64 AVX ~ Hina:
- Specially tuned for the Intel Sandy Bridge Core i7 processor line.
- ~10% faster than x64 SSE4.1 on Sandy Bridge Core i7.
- Requires Windows 7 Service Pack 1 or later.
The final output of digits at the end of each computation is now faster.
Note that this has no effect on benchmarks since outputting digits to disk does not count towards computation time.
The built-in Digit Viewer is now faster.

Fixes:

The Linux binaries are now statically linked.

August 28, 2010

0.5.4 Build 9157
(fix 1)

Linux Only

Optimizations (Linux):

Retuned I/O. This may or may not be faster than before. Note that raw I/Os are still not used because the current implementation is slower than using straight-forward buffered I/Os. (This is in contrary to Windows where raw I/Os are faster than buffered I/Os.)
Slightly improved speed. Added "-ffast-math" to the compile options.

New Features:

Colored console output has been added.
CPU brand detection has been added.
CPU frequency detection has been added.
Memory detection has been added.
- Automatic memory selection has been enabled in Advanced Swap Mode.

August 16, 2010

0.5.4 Build 9150
(fix 1)

Linux - Only

New Features:

This is the first Linux release. It is slower and does not support all the features as the Windows version, but it's a start.

August 5, 2010

0.5.4 Build 9148
(fix 1)

Fixes:

Fixed a bug that would cause all computations longer than 24.8 days to trigger a "Sanity Check Error".
Fixed a bug that would cause a "Write Error" when using more than 10 drives in Advanced Swap Mode.

August 2, 2010

0.5.4 Alpha
Build 9146

New Features:

Checkpointing: Advanced Swap Mode computations can now be interrupted and restarted at certain checkpoints. This allows large computations to survive events such as power outages and unrecoverable computational erors. It also allows computations to be paused and restarted.

Improvements:

Better Error-Correction: The program is now better able to recover from computational errors. Some computational errors that were uncorrectable in previous versions are now correctable in v0.5.4.

May 13, 2010

0.5.3 Build 9134b
(fix 2)

Fixes:

Fixed a memory leak at the end of each computation. This affects Batch Mode the most because it runs many computations in succession.

April 26, 2010

0.5.3 Build 9133b
(fix 1)

Fixes:

Fixed a bug in the Compute + Verify option for Euler's Constant.

April 15, 2010

0.5.3 Alpha
Build 9132

Changes:

The Stress Test feature will now run in below normal priority to increase system responsiveness.
When an error is detected in the stress test feature, both threads will stop after completing (or failing) their current tests.
These are thanks to a number of requests that I have received from some people.
When switching to Advanced Swap Mode, the program will now choose a default memory setting based on the amount of total and available physical memory that is in the system.

The ability to set the memory usage in Advanced Swap Mode was less than obvious in v0.5.2. This resulted in some users using the default lowest memory setting when it could have been a lot faster to use more memory.
In the Benchmark feature, benchmark sizes that require more memory than there is available are faded out. (Though they can still be run.)
The "Validation.txt" files that the program outputs can now be customized with your name/screenname (i.e. a way to identify that the benchmark was done by you).

Fixes:

Though technically not a fix, this version adds a check that detects a bug in Windows where thread creation will sometimes return a normal return value when in fact the thread fails to be created due to insufficient memory.

Previously, this would result in silent errors that would cause a computation to give incorrect digits or trigger other redundancy checks later in the program.
The sensitivity of the cheat-detection has been slightly decreased as it had been giving a lot of false positives on certain motherboards with less precise hardware timers.
Fixed an integer-overflow bug in the 32-bit binaries that would occur when writing decimal digits at the end of a computation that is larger than ~41 billion digits.
Fixed a possible stack-corruption bug for computations larger than 500 billion digits.
Fixed some minor bugs in the interface.

Optimizations:

Algorithmic change in the final Base Conversion for all constants. The new algorithm is a partial implementation of the Scaled Remainder Tree method that was used in the current world record for Pi.
- This switch provides a near 2x speed up for the conversion - or about 10% for Pi computations.
- The rest of the algorithm will be put off to a later version and is expected to give another 30 - 40% speedup for the conversion.
As a side-effect of the new conversion, the memory requirements for square roots, Golden Ratio, e, and Pi have decreased slightly.

New Redundancy Checks:

The speedups brought on by the new conversion algorithm opens up an opportunity to add some new redundancy checks to increase the reliability of the program without decreasing performance.
A verification has been added to the Base Conversion at the end of each computation.
Though somewhat expensive, this verification is done after writing the decimal digits to disk and does not count towards the "Computation Time" parameter. Therefore it does not really count as a performance penalty.

This verification is needed because of a change in algorithm for the conversion. (see "Optimizations")

Unlike the old algorithm, this new algorithm is not sufficiently self-verifying. Therefore, a verification is needed to catch that any computation errors that fail to propagate to the last 100 digits. (since only the last 51 - 100 digits are checked to see if a computation is correct)

This verification also gurantees that the entire base conversion has been done correctly with a certainty of 2⁶¹. (An error has a 1 in 2⁶¹ chance of not getting caught.)
A verification has also been added to the "Final Multiply" for Advanced Swap Mode Pi computations using the Chudnovsky algorithm. This ensures that any computational error that fails to propagate to the last few hexadecimal digits will be caught with a certainty of 2⁶¹. This comes with a slight performance hit.
A number of new and extremely aggressive redundancy checks have been added to:
1. The "series" for all applicable constants except for Euler's Constant.
  (Redundancy checks for Euler's Constant will be included in a later version.)
  
  In the future, the program will also attempt to correct for errors as well.
2. Newton's Method for Division and Square Roots.
3. Within the new Base Conversion algorithm. (This will actually attempt to correct for errors too.)
- Note that the first of these does come with a noticable (but small) performance hit.
- Also note that these redundancy checks (although aggressive), will still in no way guarantees that a computation that finishes will finish with the correct results. Verification of the digits will always require a separate and independent algorithm. (or from known pre-computed results)

March 10, 2010

0.5.2 Alpha 3
Build 9082

Fixes:

Removed two hidden line-feed characters that were present in the validation files.
These were unintentional and were causing validation problems because they are non-standard and were being messed up by various text editors and viewers.
Fixed an issue that would prevent the program from being able to perform arithmetic above ~20 trillion digits. As of v0.5.2, only Square Roots and Golden Ratio are unlocked beyond 10 trillion digits. Neither of them use full size arithmetic so they will not actually fail until ~28 trillion digits.

March 4, 2010

0.5.2 Alpha 3
Build 9074

Fixes:

Fixed an issue where the program will halt with an assertion error when it tries to print a line that is longer than 78 characters long.
Only computations larger than 10 trillion digits will be large enough to trigger this.

March 3, 2010

0.5.2 Alpha 3
Build 9072

Reliability Update:

y-cruncher now uses raw, non-buffered I/Os. This serves to bypass a number of MAJOR memory issues arising from sub-optimal OS buffering.
This primary purpose of this is to fix one MAJOR issue when handling extremely large files.
When creating a large file for non-sequential writes, Windows will attempt to cache a "small" percentage of the file. What exactly is it caching? I have no idea, my guess is that it's trying to cache the portion of MFT that maps the file.

The problem arises when that "small" percentage is not that "small" anymore when the swap files are terabytes large...

In one of my test runs, a 2.7 trillion digit multiplication failed when the program attempted to do non-sequential writes to four 1 TB swap files (total 4 TB large). The result was that the system cache exploded which immediately triggered Windows Error Code 1450 because of insufficient virtual memory. Because of the work-around that was added in 0.5.2.9040, the program was able to continue after increasing the virtual memory size. However, it continued to thrash virtual memory for several hours before the program was terminated manually. The thrashing simply showed no signs of stopping.

After diagnosing the cause of the system cache spike (which was more than 10 GB large), it was determined that it was due to the OS's stupid caching schemes. (of course it was never really designed for this kind of use...)

The only true work-around to the problem was to completely avoid OS buffering by using raw I/Os.

February 26, 2010

0.5.2 Alpha 2
Build 9040

Reliability Update:

Added a work-around for an issue where page-thrashing can cause a Windows Error Code: 1450 (ERROR_NO_SYSTEM_RESOURCES).
This is actually not a bug in the program. It is an issue in Windows. In Advanced Swap Mode, there may be long periods of time where y-cruncher does not use all of its allocated memory. As a result, Windows will page out some (or all) of the unused portion. However, when y-cruncher finally does need to use it, Windows will thrash the pagefile like crazy. The resulting stall can sometimes be enough for the OS to fail an I/O with an error code 1450.
This "work-around" isn't really a work-around at all. Instead of terminating the program when it encounters an I/O error, it simply pauses and retries until it either completes sucessfully, or the user decides to kill the program because something else is clearly wrong. This may also give other types of I/O failures another chance in case of a random failure of some sort.

February 25, 2010

0.5.2 Alpha 2
Build 9037

Fixes:

Fixed an issue that may prevent extremely large computations from working properly when a low memory cap is selected.
This is an issue in the 5-step convolution algorithm for squaring. This does not affect 5-step convolution for multiplication. Not all cases of squaring via 5-step convolution are affected. Only when the memory selection is very low does it occur. (3.3 trillion digits using less than 8GB of ram will trigger this.)
When this issue is triggered, one of two things may happen:
1. The program will be tricked into using 3-step convolution - which may result in extreme performance degradation.
2. The program will terminate with an error stating that there is insufficient memory.

February 23, 2010

0.5.2 Alpha
Build 9025

Fixes:

MAJOR fix to Advanced Swap Mode. This version fixes an issue that was causing a major performance degradation in Advanced Swap Mode.
The source of this is because of slight differences between the public releases and the private betas.

Generally speaking, only the internal builds of the program are tested. Those internal builds have extra code in them that displays detailed debugging information. None of this code is compiled in the public releases.

It just unfortunately turns out that there was some "required" code that was accidentally put with the debugging code. So it was not compiled in the public versions for v0.5.2.9021.

This build fixes those errors.

February 23, 2010

0.5.2 Alpha
Build 9021

New Features:

Advanced Swap Mode:
- Yeah, it's about time... This is probably the only thing that matters in this version. :P
- Allows large computations to be done using very little ram.
- Full support for multiple hard drives: Although this may seem redundant of Raid 0, it allows for unlimited drives. This serves to overpass limitations imposed by Raid 0. (which are usually limited to 4 - 6 drives)
  
  Total bandwidth scales linearly with the # of drives, but bottlenecked by the slowest drive. This is potentially better than Raid 0 in some cases. You will need to play with the settings to achieve the optimal combination of Raid 0 and the multi-hard drive setting in y-cruncher.
  (For example: 3 x 4-way Raid 0 vs. 4 x 3-way raid 0.)
- Note that this is a very primitive version of Advanced Swap Mode. It has also yet to be burn-in tested so it's potentially very buggy. And lastly, it isn't supported for all constants and algorithms yet.
  
  The x86 versions now use 64-bit indexing for Advanced Swap Mode. So they should be clear for computations greater than 20 or 41 billion digits (which are the respective limits for signed and unsigned 32-bit indexing).
  However, I have yet to test them above those sizes so they may still fail if there are any remaining 32-bit indexes that should have been converted to 64-bit.
  
  The x64 versions have always used 64-bit indexing for everything, so they are clear for all sizes up to the theoretical limit of the program.
  
  Future improvements should include:
  - Reduced total disk memory usage.
  - Reduced # of disk I/Os.
  - Checkpointing and crash-recovery.
  - Support for all constants and algorithms.
  - A 3rd algorithm for Catalan's Constant. The current secondary algorithm is extremely I/O bound due to its use of the AGM (Arithmetic Geometric Mean).
New Validation Scheme:
- The validation now provides much more detail than in the previous versions.
- All computations are now validated.
  - All constants. Not just Pi.
  - All algorithms.
  - All computation modes.
- Even failed computations and benchmarks are validated. They will simply be marked as "failed" in the validation certificate.
- The validator is now easier to use. Just upload the file. No more manually entering fields.
- Only the benchmark feature for Pi will be able to auto-verify the computed digits. But the last 50-100 digits that are computed will be included in the validation so that they can be verified using external sources.

Fixes:

Memory estimation is more accurate. (Previously, it would underestimate actual memory usage by as much as 50MB.)
The Stress Test feature will no longer over-shoot the target memory usage by about 50MB. (Same bug as above.)
When entering a write path or a swapfile path, the program will now actually check to see if the path is valid and writable.
Fixed the timers in the "Compute + Verify" option for e.

Limits:

Advanced Swap Mode without raising the limits of the program is kinda useless:
- The limits for e, and Pi have been raised to 10 trillion digits.
- The limits for Log(2), Log(10), and Zeta(3) have been raised to 1 trillion digits.
- The limits for Catalan's Constant, and Euler's Constant have been raised to 250 billion digits.
Note that these limits are well above the current world records for each respective constant. (At the time of this writing.)

So feel free to attempt a world record if you have the resources. But bare in mind that the program has NOT been tested at these sizes. So there is no guarantee that it will function correctly.

I can no longer afford to tie up my machines for extended periods of time, so I can't do anymore long running computations on my own machines.
The x86 versions are all limited to 80 billion digits. This is because computations above that size become extremely inefficient without using more than 2GB of ram (which is the limit for x86).

Internally, x86 versions are capable of performing much larger computations than a mere 80 billion digits. But it would be completely impractical to do so without the use of SSDs (Solid State Disks) - which is not recommended anyway because of write-wear.
As with the previous few releases, Square Roots and Golden Ratio have no limit.
They are capped at 90 trillion digits to give a couple orders of safety margin before reaching the precision limit of 64-bit floating-point - which is the true theoretical limit of the program.

Optimizations:

All x64 binaries are now a bit faster. (16 register tuning)
- Prior to this version, the vast majority of performance-critical code was written on x86 and tuned for 8 GP and SSE registers - which is sub-optimal for x64. The x64 binaries for this version are better tuned for 16 registers (GP and SSE).
- 5 - 12% faster on AMD K10.
- 2 - 6% faster on Core 2.
- 2 - 5% faster on Core i7.
- The speed of the x86 binaries is unchanged since v0.4.4.

Other:

This version is not speed consistent with v0.4.3 - v0.4.4 (which have become semi-standardized). Furthermore, this version will likely be the first in a series of successive optimizations. Therefore the use of v0.5.x for competitive benchmarking should be held back until the speed of the program stabillizes.
Advanced Swap Mode opens the possibility for hard drive benchmarking. But this will be heavily biased towards machines with a lot of ram and a lot of hard drives (or SSDs) running in parallel.

(Which could turn into a competition of who has the "most" hardware - rather than who has the "best" or "best tweaked" hardware...)

Janurary 6, 2010

0.4.4 Build 7762b
(fix 2)

Fixes:

Fixed the benchmark validator.

December 2, 2009

0.4.4 Build 7760
(fix 1)

New Features:

The last 50 - 100 digits are printed out at the end of a computation.
The Compute + Verify modes for all constants that support it will now actually compare the digits from the computation and verification runs to see if they do indeed match. This auto-compare already existed in v0.1.0 - v0.2.1, but was taken out completely from v0.3.1 onwards. This release re-enables this feature. But for the sake of efficiency and ease of implementation, it only compares the last few digits of the two runs to determine if the computations match (whereas v0.1.0 - v0.2.1 compared ALL the digits).

Fixes:

Fixed a bug in the CPU consumption and utilization %'s.
Fixed some minor bugs in the x86 binaries.

Changes:

The CPU consumption and utilization measurements no longer include the time needed to write digits to disk. They now only measure the actual computation time. Writing digits to a slow disk had the effect of drastically lowering utilization and efficiency %'s leading some to beleve that the program is a lot less efficient than it really is. Enabling vs. disabling hexadecimal output also had a huge effect on the measurements.

November 18, 2009

0.4.4 Build 7748

New Features:

New specially tuned binary for AMD K10 Processors.
Start and end dates have been added to computations. (Useful for those extra long computations.)
CPU utilization and multi-core efficiency statistics have been added.
Added an "Advanced Options" section. The benchmark validator has been moved there.
Users who are running an x86 OS on an x64 SSE3 capable system will be informed.
Added detection support for AVX and FMA instruction sets.

Optimizations:

x64 SSE3 ~ Kasumi: (Credit to Raymond Chan.)
- Specially tuned for Phenom II X4.
- 0.5 - 2 % faster than v0.4.3 (x64 SSE3) on Phenom II X4.
Slightly faster Log(2), Log(10), and Euler's Constant.

September 29, 2009

0.4.3 Build 7681

New Features:

Colored Console Output:
- Slightly less dull-looking than previous versions. :)
Automatic Version Detection:
- Launch Executable. It will automatically choose the best version of the program to run.
Validated Batch Benchmarks:
- Standard and SuperPi-sized batch benchmarks now provide validation.
Stronger Anti-Tampering Protection:
- Binaries that have been tampered with will not run.
- Helps guard against validation cracking via modding the executables.

Fixes:

Fixed a bug in the Digit Compare feature.

Optimizations:

All Binaries:
- All SSE binaries are now compiled using the Intel Compiler.
- Numerous internal optimizations.
- Status refreshing has been capped to once/second to reduce printing overhead for small benchmarks.
x64 SSE4.1 ~ Ushio:
- Specially tuned for Core i7.
- 5 - 18% faster than v0.4.2 (x64 SSE3) on Core i7.
x64 SSE4.1 ~ Nagisa:
- Specially tuned for Harpertown.
- 0 - 12% faster than v0.4.2 (x64 SSE3) on 2x Harpertown.
x64 SSE3:
- Retuned for a smaller cache. (Previously tuned for 3MB cache/thread.)
- Speed up vs. v0.4.2 (x64 SSE3) varies by processor. (Typically around 5 - 10%)
x86 SSE3:
- Retuned for a smaller cache. (Previously tuned for 2MB cache/thread.)
- Much improved multi-core efficiency for small computations.
- 15 - 50% faster depending on computation size.
- Single-threaded timings are now competitive with PiFast 4.3.
x86:
- Retuned for a smaller cache. (Previously tuned for 2MB cache/thread.)
- Much improved multi-core efficiency for small computations.
- 10 - 40% faster depending on computation size.

Overall:

This is the first release dedicated primarily to optimizations. There are few functional changes.
Note that the binaries have gotten a lot larger since v0.4.2. This is because the Intel Compiler does more aggressive optimizations than the Visual Studio Compiler.

August 10, 2009

0.4.2 Build 7438

New Features:

Batch mode option for running automated benchmarks.
Stress Testing option for stability checking and burn-in testing.

Fixes:

Corrected the name for the secondary formula for Catalan's Constant.
Corrected some spelling errors.

July 24, 2009

0.4.1 Build 7412 (fix 1)

Fixes:

Fixed a major bug in the Basic Swap mode for the x64 binaries.

July 22, 2009

0.4.1 Build 7409

Fixes:

Fixed an issue where a "Sanity Check Error" would sometimes occur for extremely fast benchmarks that take less than a few seconds.

July 20, 2009

0.4.1 Build 7408

New Constants:

Square Root of any small integer
Golden Ratio
e

New Features:

Added "SuperPi" sized benchmarks:
- 1M, 2M, 4M, etc... up to 128G.
Existing benchmarks have been extended to 100b.
- To satisfy those who have access to server-racks and super-computers... Don't even try these on a desktop... :)
Digit Compare is back and with full support for compressed digits.
Compute and Verify is back for the constants that benefit from reusing steps.
- e
- Log(2) and Log(10)
- Euler's Constant

Fixes:

Fixed a bug in the secondary formula for Euler's Constant where it would sometimes terminate with an "Allocation Failure" even when there is plenty of memory.
Fixed a bunch of bugs in the x86 binaries...

Optimizations:

Minor speed-ups in a few random places.

Limits:

Thread limit has been increased from 64 to 256 threads.
200 billion digits for Square Roots, Golden Ratio, e, and Pi.

Other:

A lot of code has been rewritten and retuned in preparation for some future features. So there may be some minor speed differences for all computations.
Dropped support for x64 without SSE3.

May 14, 2009

0.3.2 Alpha
Build 6953 (fix 1)

Fixes:

Fixed a major bug in the digit viewer where it may incorrectly view compressed decimal digits in .ycd files larger than ~2 GB.

April 30, 2009

0.3.2 Alpha
Build 6945

New Features:

Added a single-threaded mode for Benchmarks.
Minor improvements to benchmark validation.
Benchmark validation is now slightly more resistant to cheating.

Optimizations:

Computations now require less memory. (~20% for Pi, less so for other constants)
- The 2.5b, 5b, and 10b benchmarks will just barely fit into 12GB, 24GB, and 48GB of ram respectively - perfect for triple channel Nehalem systems.
The % complete status now has a bit more resolution.

Fixes:

The error-correction feature has been fixed. In benchmark mode, errors will automatically fail a benchmark even if the error is recoverable.
Some inconsistencies with the reported cpu frequency have been fixed. Note that the incorrect readings on multiplier-jacked CPUs have NOT been fixed yet.

April 17, 2009

0.3.1 Alpha
Build 6897

New Features:

Benchmark Validation and Anti-cheat protection.
- Pre-set sizes for validated benchmarks:
  - x86: 25m, 50m, 100m, and 250m
  - x64: 25m, 50m, 100m, 250m, 500m, 1b, 2.5b, 5b, and 10b
    - The larger benchmarks will require a LOT of ram. New Challenge!
    - How high can you overclock while maintaining a full ram configuration?
    - How high can you overclock a fully-loaded workstation?
- Benchmark computations will be verified against known digits to ensure that they are correct.
- Anti-clock tampering protection.
  - Timings now use hardware clocks - which are more accurate and cannot be tampered with via system clock.
  - Try to tamper with the clocks (there's more than one), and it will fail validation.
- Validation Checksum
  - Checksums are computed from Benchmark Time, CPU frequency, CPU type... (among other things).
  - Protects against output tampering.
  - Protects against system substitution. (transfering the output of a valid benchmark from a faster computer to a slower one)
New Layout for Option Selection
- The program starts with a set of default options - which can be changed manually. This avoids all the option selection from the previous versions.
- Auto-detect # of threads.
- Shows estimated disk usage for swap computations.
- Output path can be now be specified.
Compressed Digit Format
- Hexadecimal digits will compress to 50% of text-file size.
- Decimal digits will compress to roughly 42% of text-file size.
- Compressed digits can be read directly by the new digit viewer.
- Compressed digits can be split into smaller files and accessed individually by the digit viewer.
- This feature is already present in the new Digit Viewer. Version 0.3.1 fully integrates it.
Euler's Constant can now be computed to any # of digits. (They were locked to specific sizes in the previous versions.)
Compute and Verify + File Compare have been temporarily disabled as they need to be updated to support the new compessed digit format.

Overall:

This release consists of mostly interface changes. No optimizations. No bug fixes.

April 10, 2009

0.2.1 Alpha
Build 6841

New Constants:

Log(10)
Zeta(3) - Apery's Constant
Catalan's Constant

Fixes:

Fixed a pagefile thrashing problem when writing digits at the end of a large computation that used all the ram in a computer.
FFT setttings have been pulled back to more conservative levels. This comes at a slight speed penalty, but is necessary to ensure reliability.
Fixed some errors that were caused by the program being a little bit too aggressive with multi-threading. This also comes at a slight speed penalty.
Added an extra redundancy check for base conversions. (see below)*

Optimizations:

Faster "Compute and Verfy" for Log(2) via a better pair of Machin Formulas.
Improved multi-core efficiency. Barely noticable on dual-core but obvious improvement on 8-core. (Nearly 10% improvement in some cases on 8-core.)
Basic Swap Mode is now a bit faster and requires only half the memory from before.

Size Limits:

x86
- 466 million digits for Euler's Constant
- 840 million digits for all other constants
x64
- 29.8 billion digits for Euler's Constant
- 31 billion digits for all other constants

Overall:

This second release (as well as the next few) consists mainly of new features and bug fixes. There won't be much in the way of optimizations. Therefore, the next few releases won't be much faster (and maybe even a bit slower if certain bugs fixes necessitate it). I'll make up for it when I start doing optimizations.

*This extra redundancy check is needed to close a small weakness in the method that y-cruncher uses to verify its base conversions.

In order to understand the following paragraph, you must be familiar with radix conversions on floating point numbers.

For a record size computation to qualify as a new world record, it must be verified.
y-cruncher performs a base conversion on a number by first normalizing it to an integer, and then base converting the integer.
The current method of verifying a base conversion is to do it twice using different cutting parameters and apply a modular hash check on the final (integer) base conversion. However, I have found that the powering stage of the normalization step goes through much of the same arithmetic even with different cutting parameters. This opens up a weakness. Since the base conversion is done twice, any hardware errors will be caught. However, if there is a bug (programmer or compiler error) that affected the normalization, it may result in the same incorrect answer for both conversions because of the "shared" arithmetic (and thus pass final verification).
To close this weakness, I have added a modular hash check to the powering stage of the normalization. All existing records that have been set prior to this change should still be fine because y-cruncher already has redundancy checks built into its multiplication. And of course, the digits agree with previous records.

January 19, 2009

0.1.0 Alpha
Build 6013

Initial Constants:

Pi
Log(2)
Euler-Mascheroni Constant

Initial Features:

Versions: x86, x86 SSE3, x64, and x64 SSE3
Basic Swap Mode
Multi-Threading
Multi-Hard Drive
Semi-Fault Tolerance

Size Limits:

x86
- 233 million digits for Euler's Constant
- 420 million digits for all other constants
x64
- 7.4 billion digits for Euler's Constant
- 10 billion digits for all other constants