After more than 2 years of waiting, y-cruncher with AVX512 has finally been tested on native hardware. David Carver was kind enough to test drive an internal version of y-cruncher v0.7.1 which has the AVX512-CD binary enabled. Here it is compared to some more conventional machines:
|Processors:||Core i7 5960X||2 x Xeon E5-2696 v4||Xeon Phi 7250|
|Processor Speed:||4.0 GHz (OC)||2.2 GHz||1.4 GHz|
|Binary:||AVX2||AVX2 + ADX||AVX2 + ADX||AVX512-CD|
The AVX512-CD binary uses AVX512 Foundation and Conflict-Detection instructions. It has been in development since early 2014, but has never been run on native hardware until now. Now it has been confirmed to work well enough to do a Pi benchmark.
Performance-wise, Knights Landing falls short of the highest-end Haswell-E and Broadwell-E systems. Furthermore, the AVX2 -> AVX512 scaling is a lackluster 34%. For now, the reason remains unknown. But it's currently hypothesized to be either memory bandwidth or Amdahl's Law.
It's worth noting that y-cruncher is completely untuned for the Knights Landing architecture. Nearly all optimizations and tuning settings are the same as the desktop chips. So there's likely more performance left to be squeezed out. But due to the cost of Xeon Phi systems along with the general inaccessibility to consumers, it will be a while before y-cruncher has any properly tuned binaries for Knights Landing (if ever).
The AVX512-CD binary (for both Windows and Linux) is available upon request to anyone who sends me a Knights Landing benchmark. But for now, I'm hesitant to formally release it since it hasn't been sufficiently tested. (A pi benchmark has very poor test-coverage of the entire program.)
In addition to the AVX512-CD binary, y-cruncher also has AVX512-DQ and AVX512-IFMA binaries for Skylake Purley and Cannonlake. But assuming Intel sticks with its policy of massive delays, it will be a quite while before either of them see the light of day.
This is an semi-unplanned released to address a number of critical issues with the HWBOT integration. (Most notably the reference clock skew issue.)
Other than that, there are few other user-visible features. Most of the changes since v0.6.9 are internal refactorings. Some of these were large (and dangerous) enough that it probably would've been better to wait a few more months before releasing v0.7.1. So if anything breaks, let me know.
While this version wasn't intended to have many new features, all that refactoring did lend itself to a some opportunistic stuff such as large pages and Unicode support.
I get asked these two questions a lot:
#1 never happened because I suck at UI programming and I didn't want that mixed in with performance critical code.
#2 never happened because the HWBOT benchmark API wasn't ready.
Well, both finally happening... More details here: http://forum.hwbot.org/showthread.php?t=155079
Anyone who has been following my GitHub profile for past year will know that I've been working a library that exposes the compute-engine of y-cruncher. Well that's finally done and pushed out the door. (It was actually completed in January, but I waited until now following my usual "wait several months for Q/A".)
In any case, the spin-off project consists of two components:
YMP stands for "y-cruncher Multi-Precision Library". For the most part, it's just another bignum library - except that it supports SIMD and parallelized large multiplication.
Number Factory is largely a test app for the YMP library. It implements much of the same functionality as y-cruncher, albeit more cleanly and less efficiently.
The two can be found on my GitHub: https://github.com/Mysticial/NumberFactory
Documentation for the library can be found here: http://www.numberworld.org/ymp/v1.0/
For now, the project is entirely experimental and is available only for 64-bit Windows with Visual Studio 2015. It is far from mature and there are no plans to support Linux in the near future. But at the very least, it will let people code things up that utilize y-cruncher's parallel large multiplication.
Intel seems to be taking its fine time with the AVX512 stuff. So I guess that's not happening any time soon...
While that endless wait continues, there's some scalability improvements that have been sitting on my desk for months:
*The Linux versions of y-cruncher have historically been statically linked as a means to avoid the DLL dependency hell on Linux. Unfortunately, Intel does not provide a static library for Cilk Plus - thereby forcing dynamic linking. After fiddling with this for multiple weekends, I can't get anything that will run on just my 3 Linux boxes.
I give up. For now, y-cruncher for Linux will be available in both static and dynamic versions. The dynamic version will target a recent version of Ubuntu and will support Cilk Plus. The static version will run almost anywhere as before, but it lacks Cilk Plus.
This applies to Ubuntu 15.10, but may also apply to other Linux distributions with the same kernel version.
When performing swap mode computations on Ubuntu 15.10, the OS has been observed to do excessive swap file access when y-cruncher is performing heavy I/O. This swapping is so severe that the OS becomes unresponsive and the computation stalls. It has not been observed in Ubuntu 15.04.
So far, the only solution that seems to work reliably is to completely disable the swap partition. Then reduce the amount of memory that y-cruncher should use so that you don't run out of memory. Another possible solution is to set the "swappiness" value to zero. But this is untested.
The next version (v0.6.9) will have a lower default memory setting for swap mode.
I'm pleased to announce that after running for more than half a year, Robert Setti has computed Catalan's Constant to 200 billion digits.
This is very impressive because Catalan's Constant is one of the slowest to compute. (among popular constants that can be computed in quasi-linear time)
A lot of unexpected personal stuff happened this last month. I'll be starting a new job next week that is potentially much more stressful than ever before.
So I've decided to push out all the remaining bugfixes for v0.6.8. Depending on how things turn out, this may very well be the last version until the Skylake Xeon.
I totally knew this would happen... The moment I rush a release (for Pi Day), something breaks.
It turns out there was a very large (up to 5%) performance regression on Haswell processors that scales inversely with the memory bandwidth of the system. Normally, such large regressions are caught long before they can be released. But since my primary test machine has 4 channels of overclocked DDR4, I never noticed the regression at all. It was only after Shigeru Kondo reported this did I test it on a different Haswell machine which did reproduce the regression.
This screw-up involved a cache optimization that was designed into the algorithm from the very beginning rather than being added later as a result of profiling. Being a premature optimization, it backfired for small computations and had no noticeable effect for large ones on my machine. Therefore I disabled it sometime between v0.6.7 and v0.6.8. Not a big deal, I expect some of these premature micro-optimizations to backfire.
Well... It turns out that the premature optimization wasn't really that premature after all. On large computations, it reduces the demand on memory bandwidth. If the processor is bandwidth-starved (such as mainstream Haswell), it translates to an increase in performance. Therefore, this patch re-enables the optimization on all processors except for AMD Bulldozer - which takes a 10% performance hit for some unknown reason.
The sad part is that, back in 2012, I did this optimization because I predicted that it would reduce the demand on memory bandwidth. But it wasn't until 2015 did the hardware and software become fast enough for it to actually make a difference. During those 3 years, I completely forgot about the optimization and left it in the "on" position until a recent refactoring touched it. That prompted me to re-evaluate it and (erroneously) disable it for the Pi day v0.6.8 release.
Surprise! There was no way I could possibly pass this day up right?
y-cruncher v0.6.8 is out with some new features:
It turns out that v0.6.6 had yet another serious bug that would cause a large multiplication to fail under the right circumstances. But at least this time, it wasn't related to swap mode. So after fixing that and cherrypicking it into v0.6.7 (along with 3 other things), I think we're good to go.
About that unstable workstation...
While the system still isn't entirely stable in Linux yet, it's in good enough shape to do longer running stuff.
This instability turned out to be a good opportunity to test the program's never-used RAID3 implementation. The hard drive configuration was 2 sets of 8 drives each in RAID3. One computation tolerated 9 hard drive read errors and still managed to finish correctly.
This entire instability mess has prompted me to update the user guide for Swap Mode with a new section.
Version 0.6.7 has been built and is undergoing final testing. But I have no idea how long that will take. While everything looks good on Windows, testing on Linux is currently blocked while I diagnose an instability on my storage workstation with Linux.
The likely source of the instability is a massive hardware upgrade in December. Since Windows is fine and Linux is unstable, I suspect it's a driver issue. But I have yet to sort it out. The fact that I'm not much of a Linux person isn't really helping the situation.
In any case, part of that hardware upgrade involved adding 8 hard drives to the machine for a total of 16 drives. So v0.6.7 consists of mostly swap mode improvements that I decided to do after playing around with this 16 hard drive toy.
The main feature of v0.6.7 is a swap-mode multiplication tester which has two purposes:
The second point is important for anyone attempting world record computations. As the sole developer of y-cruncher, I only have the resources to test large multiplications up to around 5 trillion digits. Which means that I cannot reach the sizes that are now required to set Pi world records.
In the past, there have been bugs in the multiplication which only manifest at sizes that nobody has ever reached before. The scenario that I want to avoid is for someone else to spend months attempting a world record only to fail because of a bug in my code. In a sense, it's somewhat miraculous that y-cruncher is 4 for 4 in world record attempts for Pi. (i.e. no fatal software bugs)
Version 0.6.6 is turning into one of the worst versions of y-cruncher. Once again, I've found some serious bugs that need to be fixed asap.
Because of the severity of the regressions that have been fixed, I highly recommend everyone who uses swap mode to upgrade to this patch (v0.6.6.9452).
In particular, the error-detection in v0.6.6 prior to this patch is so badly broken that it's as good as useless.
There's an on-going task to refactor the entire program from C -> idiomatic(ish) C++11 for better long-term maintainability. In v0.6.6, this refactor touched some of the fragile swap mode code - which is very difficult to test because it is slow, resource-intensive, and full of corner cases.
This release is a bit premature since it hasn't been tested much yet. But it fixes a lot of bugs including one that causes all large Pi benchmarks to fail on the 6-core Haswell processors. So ready or not, it needs to be released. In addition to bug-fixes, v0.6.6 adds a couple of new features and some minor optimizations.
In memory of my grandfather who passed away last month. He loved numbers and is probably why I do too...
I'm please to announce that "houkouonchi" (who wishes to remain anonymous) has set a new world record for the digits of Pi with 13,300,000,000,000 digits.
The computation took 208 days and was done using y-cruncher v0.6.3 on a workstation with the following specs:
Verification using the BBP formula was done by myself and took 182 hours on a Core i7 920 @ 3.5 GHz.
Overall, this computation was slower than Shigeru Kondo's 12.1 trillion because the machine had less disk bandwidth and was not dedicated to the task.
More details coming soon...
For now, the digits can be downloaded here*: http://fios.houkouonchi.jp:8080/pi/
You can contact houkouonchi at: firstname.lastname@example.org
*In order to view and/or decompress the digits, you will need the Digit Viewer. It comes bundled with y-cruncher.
It took way too long, but support for AVX2 has been added. The new binary targets Haswell processors and requires AVX2, FMA3, and BMI2 instructions.
Theoretically, it should also be able to run on AMD Excavator processors.
As a word of warning: On Haswell, the AVX2 binary runs considerably hotter than with just AVX. So please take care when running it (with or without overclock).
This is especially the case with all the thermal problems that Haswell has.
Oh hey look at the date! The long promised (and overdue) version for AMD processors is finally done. The "x64 XOP ~ Miyu" binary is optimized for AMD processors and uses FMA4 and XOP instructions. It will not run on Intel processors.
AVX2 is next on the list. But progress has been severely hindered by numerous issues with the Visual Studio compiler. VS2012 has severe bugs in its AVX2 code generation. VS2013 has a 10 - 30% performance regression in AVX code generation. Both versions generate terrible FMA3 code.
Long story short, expect the next version of y-cruncher to see the return of the Intel Compiler...
While doing some performance tuning on the XOP binary for v0.6.4, I discovered a nasty bug. The bug was introduced in v0.6.3 and also affects the SSE3 binary.
So a patch has been released along with a couple other minor bugs that were discovered after v0.6.3 was released.
It had been sitting on my desk for quite a few months now. And now it's out. It adds the Lemniscate Constant and brings back the Euler-Mascheroni Constant.
There have also been a number of refactorings and re-tunings since v0.6.2. So expect some slight differences in performance (both up and down).
As some of you already know, AVX is slower than SSE for AMD processors. The reason for this is explained here.
Unfortunately, y-cruncher is no exception: The performance of the AVX binary is much worse than that of the SSE3 and SSE4.1 binaries. Therefore, the dispatcher for v0.6.3 has been reconfigured to fall back to SSE3 for all AMD processors even if they support AVX. It falls all the way back to SSE3 instead of SSE4.1 because the SSE4.1 binaries are tuned specifically for Intel processors and don't run as well on AMD processors.
A new binary (x64 XOP ~ Miyu) will be coming out in v0.6.4 that is specifically tuned for AMD Bulldozer and Piledriver. It will use SSE4.1, FMA4, and XOP instructions.
The 10 trillion digit record had been standing for 2 years and it didn't look like anybody was trying to beat it. So we threw y-cruncher v0.6.3 at it along with some new hardware. More details here.
y-cruncher v0.6.3 will be released in a few days. Still needs a bit more testing...
Using a beta version of y-cruncher v0.6.3, it took 50 days to compute and 38 days to verify a computation of 119,377,958,182 digits of the Euler-Mascheroni Constant.
This is by far the longest computation I've ever attempted by myself using my own hardware. The main computation was interrupted multiple times due to overheating problems and a blown out power supply. After replacing the power supply and reseating the heatsinks, there were no more hardware issues. So the verification was able to run from start to finish in a single contiguous run lasting 38 days.
This is also the last long-running computation that will ever run on my aged workstation. Afterwards, the machine will be retired. I'll still be keeping it around, but I will no longer be running anything stressful on it.
y-cruncher's swap mode got a lot more complicated in v0.6.x. The lack of documentation also made it a lot harder to use.
I've finally gotten around to writing a user guide for y-cruncher's swap mode functionality.
In the future, I may add more of these for other features of y-cruncher.
Stuff happens... :(
I was originally gonna release v0.6.2 (fix 1) back in August, but some of the code-refactoring that I did had touched a bit too much of the program. So I didn't feel it was stable enough for a public release. (Shigeru Kondo knows this pretty well after I sent him some broken binaries. :P)
So v0.6.2 (fix 1) will be skipped and everything will be pushed into v0.6.3.
In addition to everything that was supposed to be in v0.6.2 (fix 1), v0.6.3 will also have:
I don't have a timeline or a release date yet. There's still a lot of testing to be done and I have less free time than when I was still in school.
In the meantime, I've released the source code to the Digit Viewer on my GitHub.
This is the exact same source that will be used to compile the Digit Viewer binaries that will be released with v0.6.3.
Any C++ programmers out there? I've been toying around with a "tiny" Pi program that can do millions of digits of Pi. Feel free to play with it.
It isn't very fast, but it hits all the necessary algorithms to get quasi-linear run-time.
I've found and fixed the problem with "O_DIRECT" on Linux. Getting that to work cut the CPU usage in half. While it's a decent improvement, it wasn't as good as I had expected. And after trying out numerous tweaks, that last chunk of CPU usage from the I/O threads won't disappear. So I'll let that rest.
The fix that solves the "O_DIRECT" issue will be rolled on the next patch: v0.6.2 (fix 1)
After finally getting everything to work on Linux, I did some tests and noticed something that really bothered me: Large swap computations on Linux are significantly slower than on Windows.
Take a look:
Notice the difference. Same computer, same settings, everything is the same except for the OS.
After digging around I was finally able to trace the issue. It turns out that the I/O operations on Linux were using a lot of CPU. And I mean a LOT - as in half a core per hard drive. (8 hard drives = 4 cores) WTF?!?!
Why does this matter? Because the program overlaps disk I/O and computation. If disk I/O is using a lot of CPU, then the computation threads will be denied a lot of CPU time. In the test case above, the disk I/O threads hogged half of the 8 cores! On Windows, the disk I/O uses close to nothing and the computation threads get nearly all of the 8 cores to grind at.
On Windows, I use "FILE_FLAG_NO_BUFFERING" to DMA all the I/O operations and bypass the OS cache. So there is no overhead - and almost no CPU usage.
Likewise, on Linux, I use "O_DIRECT" to achieve the same thing.
However, it seems that the "O_DIRECT" flag has no effect. The performance is the same with or without it. Furthermore, it seems that I can pass in misaligned buffers and sizes. So in other words, the flag isn't working. If it was, it should fail on the misaligned parameters.
Until I can figure out what's preventing "O_DIRECT" from working, the Linux version will not perform as well as the Windows version.
This issue has probably existed since the v0.5.x, but I never did any serious benchmarks on Linux until now.
Other things: I still plan to open-source the digit viewer. But I need to clean up the code first. It's pretty unreadable to anyone other than me ATM.
There were older entries. But I no longer have a record of them... :(