Catalan's Constant is No Longer a Slow Constant: (August 23, 2018) - permalink
In what is probably the first major mathematical improvement since the start of the y-cruncher project, there are 3 new formulas for Catalan's Constant.
I was recently made aware of a publication by F. M. S. Lima which directly and indirectly led to 3 new formulas for Catalan's Constant which are faster than the Lupas formula which had been the fastest for more than a decade.
|Name||Formula||Speedup vs. Lupas|
All 3 formulas have been implemented for y-cruncher v0.7.7. But the timing of this with respect to y-cruncher's release cycle (having just released v0.7.6), means it will be months before these formulas see the light of day for y-cruncher. But at the very least, we know that they exist and that they work. All 3 implementations have been tested to 1 billion digits. (which didn't take very long)
Catalan's Constant has historically been one of the slowest mainstream constants to compute - second only to the Euler-Mascheroni Constant. But thanks to these formulas, Catalan's Constant is now almost as fast as Zeta(3).
As only two independent algorithms are needed to establish records, the existing formulas by Lupas and Huvent (as well as the new Lima-Guillera formula) are now effectively out-of-date. But there are no immediate plans to remove any of the outdated formulas.
Mathematics evolves at a much slower pace than computer hardware. So it is easy to forget that math remains one of the most important parts of high precision computations. Without such formulas and algorithms, large computations would be impossible. So we must never stop thanking the generations of mathemeticians, both past and present, that have made such projects possible.
While we wait for v0.7.7, two new articles have been added to the algorithms page:
Version 0.7.6: (August 11, 2018) - permalink
A new version is out. This release is mostly a dump of bug fixes, internal refactorings, and secondary features. So not a whole lot on the outside.
The only major feature is the "18-CNL" binary for Cannon Lake processors. Due to the lack of suitable hardware from the delays in Intel's 10nm process, it will probably remain untuned and not fully tested until at least 2020. The Core i3 8121U systems that currently exist are too difficult to obtain and are likely "too small" to do the tuning that y-cruncher needs.
Spectre Attack via AVX Clock Speed Throttle: (August 7, 2018) - permalink
Somewhat off-topic, but not entirely unrelated since AVX is a common theme here.
Apparently, the AVX clock speed throttle present in Intel processors is yet another Spectre attack vector. This is something that I've been thinking about on-and-off since January, but the conceptual breakthrough didn't happen until June.
Unbeknownst to me, NetSpectre a closely related exploit, was already in the workings and was publicized in July while my thread with Intel security was still ongoing.
The contents of the blog are now out-of-date in the context of NetSpectre. But I'll leave it as is - the way I intended to publish it back in June.
Feel free to reach out to me about this topic on Twitter or something. While I'm not a researcher in this area, I do have some other ideas for Spectre attacks and enhancements which I'm not really interested in pursuing. So it would be neat if someone else can turn them into working exploits.
AVX512 Scalability - One Year In: (July 2, 2018) - permalink
When AVX512 (for consumer hardware) launched a year ago with Skylake X, few applications supported it. y-cruncher was one of those few, but the performance was very disappointing due to all sorts of hardware issues and unexpected performance bottlenecks.
Over the past year, work was done to target those bottlenecks. And while they failed to eliminate them, they still improved performance by a lot. So now that there's been enough time to properly optimize and tune for AVX512, we can finally do a fair evaluation of this instruction set for bignum crunching.
The table below shows scalability across two dimensions: instruction set (AVX2 -> AVX512) and parallelism. This processor has 10 cores hyperthreaded to 20 threads. The clock speed is normalized to 3.6 GHz for all workloads. This is admittedly unrealistic since AVX512 will usually run at a lower frequency. But it does provide a perfect clock-for-clock comparison between AVX2 and AVX512.
1 Billion Digits of Pi - (Times in Seconds)
Core i9 7900X @ 3.6 GHz - 4 x DDR4 @ 3000 MT/s
|y-cruncher v0.7.3 (July 2017)||y-cruncher v0.7.6 (ETA 2018)|
|14-BDW (AVX2)||17-SKX (AVX512)||Speedup||14-BDW (AVX2)||17-SKX (AVX512)||Speedup|
|1 thread||473.317||331.888||1.43 x||439.967||272.042||1.62 x|
|20 threads||48.658||39.862||1.22 x||41.833||30.645||1.37 x|
|Speedup||9.73 x||8.33 x||10.50 x||8.88 x|
The single-threaded benchmarks show the raw speed of AVX2 vs. AVX512. While everything got faster from v0.7.3 to v0.7.6, the AVX512 improved more. This is because v0.7.6 has all the new optimizations that can only be done with real AVX512 hardware. In contrast, v0.7.3 was released only 2 weeks after Skylake X was launched. Much of the AVX512 in v0.7.3 was written years ago using only emulators and before the hardware ever existed in silicon.
The multi-threaded runs show smaller speedups from AVX2 to AVX512. This is due to memory bandwidth becoming a factor. Even with 4 channels of high-clocked memory, Skylake X with more than a few cores does not have enough memory bandwidth to feed all the cores. Many of the optimizations between v0.7.3 and v0.7.6 were targeted at alleviating this bottleneck.
So as of 2018, the raw AVX2 -> AVX512 speedup is 62% on Skylake X with dual 512-bit FMAs. Given the amount of code that remains unvectorized, 62% is reasonable due to Amdahl's Law. Furthermore, AVX512 only doubles up the floating-point capability. Many integer operations fall well short of that.
But once everything else is factored in (AVX512 clock speed throttle and memory bandwidth with multi-threading), the real speedup of AVX512 on a large Skylake X processor with a lot of cores dwindles to around 20 - 30%. Disappointing? Yes. But not unsurprising for new technology. Perhaps things will be better when DDR5 becomes a thing.
Maybe it's too early to start talking about Cannon Lake since it will still be a long time before they hit the market in volume.
250 Million Digits of Pi - (Times in Seconds)
Core i3-8121U @ unknown fixed clock speed*
|y-cruncher v0.7.6 (ETA 2018)|
|14-BDW (AVX2)||17-SKX (AVX512-DQ)||18-CNL (AVX512-VBMI)|
The Core i3-8121U is a 10nm Cannon Lake processor with only one 512-bit FMA. Having only one FMA severely limits the performance of the baseline AVX512. But the new binary seems to care less. It's way too early to draw any conclusions yet.
*Credit to Jzw for testing this.
Pi Day and "houkouonchi": (March 14, 2018) - permalink
For those who have been following the Pi computation world records, you'll know that "houkouonchi" is obviously a pseudonym. Back in 2014 when he set the Pi world record with 13.3 trillion digits, he asked me not to reveal his real name. His reason: He didn't want to be bothered by people contacting him through his facebook and personal email.
However, houkouonchi sadly passed away in 2015.
Being an internet contact, I didn't find out about it for almost a year. Furthermore, I had no contact information for his family members.
For the past 2 years, I've been torn on whether or not to reveal his real name. On one hand, he asked me not reveal his name. But on the other hand, I felt a strong desire to put his name on his world record. Without any contact information, I've been unable to reach out to his family. And nobody is watching his email as my messages have remained unanswered.
In the end, I decided that his original reason for being anonymous is no longer applicable. Therefore I will now put a name to the record of 13.3 trillion digits.
His name is Sandon Van Ness. Rest in peace my friend.
Page Table Isolation and Large Pages: (January 7, 2018) - permalink
If you follow tech news, you should be well aware of the Meltdown and Spectre side-channel attacks that affect nearly all processors with speculative execution. Furthermore, the patches come with performance penalties that range anywhere from negligible to a ridiculous 50% depending on the application and hardware.
Is y-cruncher affected? Yes. But it may be avoidable under certain circumstances.
The following table compares performance with and without KPTI for Meltdown. Unfortunately, no BIOS/microcode update for the Spectre patch was available to test. Given the age of the system, it seems unlikely that the manufacturer will provide such an update.
|1 billion digits of Pi||y-cruncher v0.7.4|
|Normal Pages (4 KB)||Large Pages (2 MB)|
|Kernel Page Table Isolation (KPTI)||110.418||106.388|
y-cruncher spends very little time in the kernel. So based on that, one would expect the effect of KPTI to be negligible. However, there are a lot of small system calls from all the multi-threading related constructs. (mutexes, condition variables, signals, etc...)
In the end, we see a 3% performance impact when using normal (4 KB) pages. But when switching to large (2 MB) pages, that penalty disappears.
A possible explanation for this is that each system call that goes into kernel mode will cause a TLB flush upon its return. So even if the system call is short, it leads to a flood of TLB misses as the computation resumes and has to re-populate the TLB. Since y-cruncher has a massive memory footprint, there will be a lot of pages to re-populate. With large pages, there are far fewer of them - thereby drastically reducing the performance penalty. Though this explanation has issues since PCID should (theoretically) be eliminating the TLB flushes as far as I understand (which I admit I don't).
Regardless of the exact reason for why large pages help so much, let's not get too excited. This is just a single benchmark on a single platform. Things may look different on other systems. Furthermore, there are requirements to enable large pages - some of which may be inconvenient.
Those interested in testing out large pages for y-cruncher can refer to the memory allocation guide.
Looking forward, the current development version of v0.7.5 is showing significantly less penalty from KPTI:
|1 billion digits of Pi||y-cruncher v0.7.5 (trunk)|
|Normal Pages (4 KB)||Large Pages (2 MB)|
|Kernel Page Table Isolation (KPTI)||104.581||103.142|
It's unclear why this is the case. But it could be a side-effect of the new bandwidth optimizations.
Version v0.7.5 is currently not ready for release. However, it is in feature freeze.
So far, I have yet to test the impact of the Spectre mitigations.
AV False Positives: (October 24, 2017) - permalink
It has come to my attention that the "y-cruncher.exe" launcher binary in Windows has been getting flagged by some virus scanners. This has been the case for at least the last several releases.
This is currently under investigation. Unfortunately, I'm not an expert in this field. So I don't really know what exactly is tripping up the AV heuristics. At this time, the only fix I've found will break compatibility with older versions of Windows.
In case anybody is willing to help, the source code for the launcher binary has been released on GitHub. Compiling it using the provided Visual Studio project will produce the binary that turns up false positive under said virus scanners.
Version 0.7.4: (October 14, 2017) - permalink
A new version is out with some changes aimed at addressing the memory bottleneck on Ryzen 7 and Skylake X.
While the changes in this release help alleviate the memory bottleneck, they fall well short of actually solving it. The memory bandwidth problem is here to stay and is expected to get worse if hardware trends continue as they have been for the past decade.
The memory bandwidth problem has actually been an ongoing cat-and-mouse game since the very first version of y-cruncher. Whenever the memory bottleneck catches up in a new generation of hardware, an optimization in y-cruncher is made to push it away again. But it looks like the cat is finally winning on the latest generation of desktop processors - especially the HCC Skylake X line.
Updates on Skylake X and Threadripper: (August 15, 2017) - permalink
This is a follow up to my analysis on the Skylake X processors as well a few notes about AMD Threadripper.
I've also released a patch (v0.7.3.9474) to fix some issues with the Linux binaries on Threadripper and Epyc.
Skylake X Follow Up:
It has been confirmed with benchmarks that all the Skylake X desktops have full-throughput FMA. This directly contradicts Intel's pre-release information. Subsequently, all the pre-release reviewers apparently got their information from the same inaccurate source from Intel. So if you are looking to purchase a Skylake X system for the purpose of AVX512, you do not need to spend $1000 to get the Core i9 7900X for the full AVX512. Either of the 6 or 8 core models (7800X and 7820X) will do.
The phantom throttling issues on Gigabyte motherboards still exist even after multiple BIOS updates. While there has been some change in the behavior of the throttling, little has been done to actually solve it. So manual intervention is still necessary to counter the throttling. The problem of throttling in general seems to be even worse in Linux. But I have yet to investigate this in depth. And as of now, I don't even know if the throttling in Linux is a phantom throttle or a normal throttle.
I look forward to see what phantom throttling looks like under the VTune profiler.
The "unknown bottlenecks" that I mentioned before is the L3 cache mesh. The L3 cache mesh on Skylake X only has about half the bandwidth of the L3 cache on the previous generation Haswell/Broadwell-EP processors. The Skylake X L3 cache is so slow that it's barely faster than main memory in terms of bandwidth. So for all practical purposes, it's as good as non-existant.
To illustrate how bad the L3 mesh is, this is what happened when I overclocked it:
1 billion digits of Pi - Core i9 7900X @ 3.8 GHz
y-cruncher v0.7.3 - Times in Seconds
|AVX2 (14-BDW)||AVX512 (17-SKX)|
|Memory\Mesh||2.4 GHz||2.8 GHz||3.2 GHz||2.4 GHz||2.8 GHz||3.2 GHz|
That's a 25% speedup just from overclocking the cache and memory while keeping the CPU frequency locked at 3.8 GHz. And we still haven't reached the point of diminishing returns. Above 3.2 GHz cache and 3400 MHz memory, the system started showing signs of instability. But I didn't try very hard to get it stable.
Overclocking the cache and memory also led to a disproportionally large increase in CPU temperatures and power consumption. This is probably due to the secondary effects of lifting the cache/memory bottlenecks which allow the code to run much more efficiently and intensively than before.
Running the L3 cache mesh at 3.2 GHz managed to trigger a small but noticeable amount of phantom throttling in a BIOS profile which I thought was resistant to it. So I actually had to redo all the benchmarks with an updated profile. In my previous phantom throttling test with the BBP benchmark, the throttling only affected the AVX512 workload. In this case, it affected both the AVX2 and AVX512 loads - but more so on the AVX512. The AVX2 runs never broke 220W on the CPU. But the AVX512 runs were consistently pulling around 270 - 320W. Prior to this, I had never observed any throttling below 260W. Assuming these power draw readings are reliable, it seems to suggest that the phantom throttling isn't entirely dependent on the total CPU power draw. So there are likely other (unknown) factors involved.
From the software optimization standpoint, the cache bottleneck brings a new set of difficulties. The L2 cache is fine. It is 4x larger than before and has doubled in bandwidth to keep up with the AVX512. But the L3 is useless. The net effect is that the usable cache per core is halved compared to the previous Haswell/Broadwell generations. Furthermore, doubling of the SIMD size with AVX512 makes the usable cache 4x smaller than before in terms of # of SIMD words that fit in cache.
The Skylake Purley binary (17-SKX) for y-cruncher v0.7.3 is currently tuned to 1 MB cache/logical-core. But since the L3 is useless, it should be 512 KB/logical-core. However, fixing this isn't as simple as changing one number in the source code and recompiling. The effect of the cache being "4x smaller" unfortunately puts it outside the domain of y-cruncher's tuning parameters. Fixing this to allow y-cruncher to run well on such a small cache will probably require uprooting a not-so-insignificant amount of code.
AMD Threadripper and NUMA:
Threadripper has been released and has brought 16-core processors to the consumer market. And there has been a lot of the talk has been about the NUMA the various memory modes, and how they affect performance.
I currently don't have a Threadripper machine to play with nor do I intend to buy/build one (I'm way over-budget on hardware this year). But as far as I can tell from the reviews, it's no different from multi-socket NUMA which has already existed for years:
y-cruncher has been NUMA-aware since v0.7.1. And starting from v0.7.3, it has been smart about allocating memory on NUMA systems.
y-cruncher prefers high bandwidth and is relatively insensitive to memory latencies. So node-interleaving is almost always better.
So in other words, it should not matter which memory mode you choose on Threadripper and Epyc since y-cruncher will automatically do the right thing. But if you wish to tinker with the memory allocation settings within y-cruncher, you can do that within the "Custom Compute" menu. (Further reading: Memory Allocation)
To summarize, no major update is needed for y-cruncher to support Threadripper's NUMA modes. Only a patch was needed to make it properly detect the NUMA topology on Threadripper and Epyc under Linux. The Windows binaries should be fine without the patch.
This has been a crazy year and it's not done yet as we're expecting the high-core-count Skylake X chips to arrive in September.
Overall, y-cruncher has actually been better prepared for Ryzen/Threadripper than Skylake X/AVX512. AMD Zen didn't bring anything new in terms of processor features. So little needed to be done on the low-level optimization side. The NUMA stuff is something that y-cruncher already supported since multi-socket servers has always been one of y-cruncher's intended use cases.
The only unexpected bottleneck with Zen was the memory bandwidth - one that was largely unactionable.
On the other hand, Skylake X brings AVX512 which led to a massive domino effect of bottlenecks and performance issues which I was completely unprepared for.
Skylake X and AVX512: (July 6, 2017) - permalink
Let's talk about Skylake X and AVX512. Because everyone's been waiting for this. Since there's currently a lack of AVX512 benchmarks and stress tests. And because of that, I've had at least half a dozen people and organizations contact me about y-cruncher's AVX512.
Okay... some AVX512 benchmarks already existed. SiSoftware Sandra had some support. And my little-known FLOPs benchmark did too. But people either weren't aware of them, or wanted more. And by advertising y-cruncher's internal AVX512 support for at least a year now, I basically brought this on myself.
So let's get to the point. Unfortunately, AVX512 will not bring the "instant massive performance gain" that a lot of people were expecting. Realistically speaking, the speedups over AVX2 seem to vary around 10 - 50% - usually on the lower end of that scale. While the investigation is on-going, there are some known factors:
Not all Skylake X and Skylake Purley processors will have the full AVX512 capability:
While this reason doesn't apply to my system, it's worth mentioning it anyway.
Architecturally, Skylake X retains Skylake desktop's architecture with 2 x 256-bit FMA units. In Skylake X, those two 256-bit FMA units can merge to form a single 512-bit FMA. On the processors with full-throughput AVX512, there is also a dedicated 512-bit FMA - thereby providing 2 x 512-bit FMA capability.
However, that dedicated 512-bit FMA is only enabled on the Core i9 parts. The 6-core and 8-core Core i7 parts are supposed to have it disabled. Therefore they only have half the AVX512 performance.
It's worth mentioning that there is a benchmark on an engineering-sample 6-core Core i7 that shows full-throughput AVX512 anyway. However, engineering sample processors are not always representative of the retail parts.
So as of this writing, I still don't know if the 6 and 8-core Skylake X Core i7's have the full AVX512. The only Skylake X processor I have at this time is the Core i9 7900X which is supposed to have the full AVX512 anyway. (and indeed it does based on my tests)
Update (July 14, 2017):
Carsten Spille from www.pcgameshardware.de has notified me that the retail Core i7 7800X does in fact have full-throughput AVX512. This goes against all the reviews that have repeatedly stated that only the Core i9s have the full AVX512. So far, Intel has not commented on this.
"Phantom throttling" of performance when certain thermal limits are exceeded:
Within minutes of getting my system setup, I started noticing inconsistencies in performance. And after spending a long Friday night investigating the issue, I determined that there was a sort of "Phantom throttling" of AVX512 code when certain thermal limits are exceeded.
"Phantom throttling" is the term that I used to describe the problem in my emails with the Silicon Lottery vendor. And it looks like I'm not the only one using that term anymore. Phantom throttling is when the processor gets throttled without a change in clock frequency. For many years, processors have throttled down for many reasons to protect it from damage. But when throttling happens, it has always been done by lowering the clock frequency - which is visible in a monitor like CPUz. Skylake X is the first line of processors to break from this and it makes it more difficult to detect the throttling.
Right now, the phantom throttling phenomenon is still not well understood. Overclocker der8auer has mentioned that it could be caused by CPUz not reacting fast enough to actual clock frequency changes. On the other hand, the tests that Silicon Lottery and myself have done seem to show the that there really is no drop in clock frequency at all.
Initially, I observed this effect only with AVX512 code and thus hypothesized that the mechanism behind the throttling is the shutdown of the dedicated 512-bit FMA. But others have found that phantom throttling also occurs on AVX and scalar code as well. In short, much more investigation is needed. The lack of AVX512 programs out there certainly doesn't help and is partially why I'm rushing this release of y-cruncher v0.7.3.
Currently, there are no known reliable ways of stopping the throttling and results vary heavily by motherboard manufacturer. But maxing out thermal limits and disabling all thermal protections seems to help. (Don't try this at home if you don't know what you're doing or you aren't at least moderately experienced in overclocking. You can destroy your processor and/or motherboard if you aren't careful.)
Update (July 9, 2017):
I got asked about this, so here's some data showing the phantom throttling at stock settings. The pink entries are the ones with phantom throttling.
10 billion Hex-Digit of Pi - Plouffe's 4-term BBP Formula (y-cruncher v0.7.3)
Core i9 7900X - Gigabyte AORUS Gaming 7 (BIOS F7a)
All Stock Settings
|Binary:||AVX2 (14-BDW)||AVX512 (17-SKX)|
|Threads/Cores||Time (secs)||Clock Speed||Power||Max Temperature||Time (secs)||Clock Speed||Power||Max Temperature|
|1 thread/1 core||408.118||4.5 GHz||58 W||70°C||215.399||4.5 GHz||62 W||71°C|
|2 threads/2 cores||211.103||4.0 - 4.1 GHz||77 W||72°C||110.990||4.1 GHz||87 W||74°C|
|4 threads/4 cores||111.948||4.0 GHz||99 W||61°C||58.836||4.0 GHz||136 W||74°C|
|8 threads/8 cores||57.189||4.0 GHz||160 W||67°C||30.145||4.0 GHz||244 W||94°C|
|10 threads/10 cores||45.957||4.0 GHz||194 W||69°C||51.879||4.0 GHz||188 W||68°C|
|20 threads/10 cores||41.669||4.0 GHz||217 W||74°C||72.242||4.0 GHz||160 W||68°C|
And here's the same set of benchmarks with the throttling eliminated with the appropriate BIOS settings. (Thanks to the guys on Overclock.net for helping me here.) The two benchmarks which phantom throttled before are no longer phantom throttled. But instead, they run hot enough to hit temperature throttling which has a visible drop in frequency.
10 billion Hex-Digit of Pi - Plouffe's 4-term BBP Formula (y-cruncher v0.7.3)
Core i9 7900X - Gigabyte AORUS Gaming 7 (BIOS F7a)
Package Power Limit1/2 = 400 W
CPU VCore Loadline Calibration = Medium
CPU VCore Current Protection = High
AVX and AVX512 capped at 4.0 GHz (turbo set to flat 41x, AVX + AVX512 offsets set to 1x)
All other settings left at default.
|Binary:||AVX2 (14-BDW)||AVX512 (17-SKX)|
|Threads/Cores||Time (secs)||Clock Speed||Power||Max Temperature||Time (secs)||Clock Speed||Power||Max Temperature|
|1 thread/1 core||454.325||4.0 GHz||48 W||53°C||239.082||4.0 GHz||58 W||70°C|
|2 threads/2 cores||228.641||4.0 GHz||62 W||55°C||119.740||4.0 GHz||80 W||74°C|
|4 threads/4 cores||113.700||4.0 GHz||94 W||59°C||59.900||4.0 GHz||134 W||74°C|
|8 threads/8 cores||57.146||4.0 GHz||159 W||67°C||30.061||4.0 GHz||239 W||95°C|
|10 threads/10 cores||46.033||4.0 GHz||191 W||68°C||24.340||3.8 - 4.0 GHz||283 W||95°C|
|20 threads/10 cores||42.143||4.0 GHz||209 W||73°C||24.972||3.7 - 4.0 GHz||294 W||95°C|
It's worth noting that there is something wrong here. At stock settings, the motherboard/BIOS is failing to apply the AVX/AVX512 offsets in most of the tests here. This allows all cores to run at 4.0 GHz under AVX512 which is causing the throttling. Furthermore, it allows individual cores to turbo up to 4.5 GHz under AVX512. In other words, the motherboard is overclocking the procesor by default.
The problem with my chip is that the "weakest" core cannot run AVX512 @ 4.5 GHz at default voltages. Doing so will crash (BSOD) the system. Therefore, I had to manually cap the AVX and AVX512 clocks to 4.0 GHz.
While I've fixed this by manually setting the AVX/AVX512 offsets, I hope that a BIOS update will fix this for everyone else who hasn't (or doesn't know to) do this. Dropping the all-core AVX512 clock speed down to 3.6 GHz was enough to avoid all throttling with the default thermal limits.
Memory bandwidth is a significant bottleneck:
y-cruncher was already slightly memory-bound on Haswell-E. Now on Skylake X, it is much worse. While I had anticpiated a memory bottleneck on Skylake X with AVX512, it seems that I've underestimated the severity of it:
(The CPU frequencies in this benchmark were chosen to be low enough to avoid any throttling or phantom throttling.)
1 billion digits of Pi - Core i9 7900X @ 3.8 GHz
Times in Seconds
|Threads||Memory Frequency||Instruction Set|
In the single threaded benchmarks, the memory frequency has less than 2% effect for both AVX2 and AVX512. Multi-threaded, that jumps to 9% and 15% respectively. This is much more than is expected for a program that used to be completely compute-bound just a few years ago.
Amdahl's law and other unknown scalability issues:
In a typical y-cruncher computation, only about 80% of the CPU time is spent running vectorized code when AVX2 is used. So by Amdahl's law, even if we get perfect scaling with the AVX512, we can only cut 40% off the run-time. Right now, the single-threaded benchmarks (which are least memory-bound) are only showing 27% speedup with AVX512 over AVX2.
This remaining 13% discrepancy is currently unresolved. Microbenchmarks of y-cruncher's AVX512 code show near perfect 2x speedups over AVX2. (Some show >2x thanks to the increased register count.) But this speedup seems to drop off as the data sizes increase - even while still fitting in cache. This seems to hint at unknown bottlenecks within the L2 and L3 caches. The fact that cache sizes haven't increased along with wider the SIMD also doesn't help.
For now, investigation is difficult because none of my performance profilers support Skylake X yet.
Implications for Stress-Testing:
y-cruncher's failure to achieve a decent speedup for AVX512 also means that it is unable to put a heavy load on the AVX512 computation units. Therefore it is not a great stress-test for Skylake X with full AVX512.
But there is one y-cruncher feature which seems to be unaffected - the BBP benchmark.
The BBP benchmark feature is contained entirely in cache is thus free of the memory bottleneck. It is able to put a much higher stress than the stress-tester and the computations. So if you run the BBP benchmark (option 4) and set the offset to 100 billion, you can still put a pretty heavy load on your AVX512-capable processor.
A future version of y-cruncher will revamp the stress-tester to incorporate the BBP benchmark as well as other possible improvements.
Version 0.7.2 and AMD Zen: (March 14, 2017) - permalink
I went through a lot of trouble to do this in time for Pi day, but here it is. y-cruncher v0.7.2 has a new binary specifically optimized for AMD's Ryzen 7 processors.
The performance gain is about 5% over the Broadwell-tuned binary and 15% over v0.7.1. It turns out that the optimizations between v0.7.1 and v0.7.2 happened to be more favorable to AMD Zen than to Intel processors. Nevertheless, this is not enough to make Ryzen beat Haswell-E or Broadwell-E.
It's unlikely that any amount of Zen-specific optimizations can make Ryzen beat Haswell/Broadwell-E. The difference in memory bandwidth and 256-bit AVX throughput is simply far too large to overcome. AMD made a conscious decision to sacrifice HPC to focus on mainstream.
As for the Ryzen platform itself: It's a bit immature at this point. I went out on launch day to grab the Zen parts. In the end, it took me 3 sets of memory and 2 weeks before I finally found a stable configuration that I could use. From what I've seen on Reddit and various forums, I've been unlucky, but I'm definitely not alone.
Slightly more concerning is a system freeze with FMA instructions which appears to be have been confirmed by AMD as a processor errata. Fortunately, the source also says this is fixable via a microcode update. So it won't lead to something catastrophic like a recall or a fix that disables processor features.
As for the Zen architecture itself. Here are my (early) observations:
For software developers, compiling code on the 1800X is about as fast as the 5960X at stock clocks. But the 5960X has much more overclocking headroom, so it ends up winning by around 15%. For a $500 processor, the R7 1800X is very impressive.
Pi Computed to 22.4 Trillion Digits: (November 15, 2016) - permalink
I woke up this morning to see what was quite possibly one of the bigger surprises I've ever seen. Peter Trueb, who had previously set records for the Lemniscate and Euler-Masheroni Constants had sent me an email with details for a fully verified computation of Pi to 22.4 trillion digits.
The exact number of digits is 22,459,157,718,361 which is precisely 1012 * Pie rounded down. This smashes the previous record of 13.3 trillion digits set by "houkouonchi" back in 2013. The computation took 105 days from July to November. It was interrupted 3 times, but otherwise went through without any major issues.
The hardware that was used was:
The 3.5 month run-time for 22 trillion digits is quite remarkable. Even though there have been several years of hardware and software improvements since the previous records, computations of this size have generally stagnated due to the inability of disk storage to keep up with Moore's Law in both size and performance.
Other notable and interesting facts:
On the software side, this is the first Pi record in 2 years. Since then, y-cruncher has gone through many changes from multiple refactorings, AVX2, the new parallel computing frameworks, new implementations of the large FFT algorithms, etc... - none of which had ever been tested at such large sizes. So this computation can be seen as somewhat of a validation of 2 years of work.
This is the first time that y-cruncher has been used to set a Pi record completely without my knowledge. In the past, I've always been made aware of the computations in order to provide technical support. But this time, everything from the computation to the necessary verification steps was done entirely by Peter Trueb and his sponsors. I took no part in it at all other than to maintain this website along with all the downloads and documentation.
Knights Landing Xeon Phi with AVX512: (October 10, 2016) - permalink
After more than 2 years of waiting, y-cruncher with AVX512 has finally been tested on native hardware. David Carver was kind enough to test drive an internal version of y-cruncher v0.7.1 which has the AVX512-CD binary enabled. Here it is compared to some more conventional machines:
|Processors:||Core i7 5960X||2 x Xeon E5-2696 v4||Xeon Phi 7250|
|Processor Speed:||4.0 GHz (OC)||2.2 GHz||1.4 GHz|
|Binary:||AVX2||AVX2 + ADX||AVX2 + ADX||AVX512-CD|
The AVX512-CD binary uses AVX512 Foundation and Conflict-Detection instructions. It has been in development since early 2014, but has never been run on native hardware until now. Now it has been confirmed to work well enough to do a Pi benchmark.
Performance-wise, Knights Landing falls short of the highest-end Haswell-E and Broadwell-E systems. Furthermore, the AVX2 -> AVX512 scaling is a lackluster 34%. For now, the reason remains unknown. But it's currently hypothesized to be either memory bandwidth or Amdahl's Law.
It's worth noting that y-cruncher is completely untuned for the Knights Landing architecture. Nearly all optimizations and tuning settings are the same as the desktop chips. So there's likely more performance left to be squeezed out. But due to the cost of Xeon Phi systems along with the general inaccessibility to consumers, it will be a while before y-cruncher has any properly tuned binaries for Knights Landing (if ever).
The AVX512-CD binary (for both Windows and Linux) is available upon request to anyone who sends me a Knights Landing benchmark. But for now, I'm hesitant to formally release it since it hasn't been sufficiently tested. (A pi benchmark has very poor test-coverage of the entire program.)
In addition to the AVX512-CD binary, y-cruncher also has AVX512-DQ and AVX512-IFMA binaries for Skylake Purley and Cannonlake. But assuming Intel sticks with its policy of massive delays, it will be a quite while before either of them see the light of day.
y-cruncher v0.7.1: (May 16, 2016) - permalink
This is an semi-unplanned released to address a number of critical issues with the HWBOT integration. (Most notably the reference clock skew issue.)
Other than that, there are few other user-visible features. Most of the changes since v0.6.9 are internal refactorings. Some of these were large (and dangerous) enough that it probably would've been better to wait a few more months before releasing v0.7.1. So if anything breaks, let me know.
While this version wasn't intended to have many new features, all that refactoring did lend itself to a some opportunistic stuff such as large pages and Unicode support.
GUI Benchmark Wrapper and HWBOT Integration: (April 3, 2016) - permalink
I get asked these two questions a lot:
#1 never happened because I suck at UI programming and I didn't want that mixed in with performance critical code.
#2 never happened because the HWBOT benchmark API wasn't ready.
Well, both finally happening... More details here: http://forum.hwbot.org/showthread.php?t=155079
Pi Day and some Spin-off Projects: (March 14, 2016) - permalink
Anyone who has been following my GitHub profile for past year will know that I've been working a library that exposes the compute-engine of y-cruncher. Well that's finally done and pushed out the door. (It was actually completed in January, but I waited until now following my usual "wait several months for Q/A".)
In any case, the spin-off project consists of two components:
YMP stands for "y-cruncher Multi-Precision Library". For the most part, it's just another bignum library - except that it supports SIMD and parallelized large multiplication.
Number Factory is largely a test app for the YMP library. It implements much of the same functionality as y-cruncher, albeit more cleanly and less efficiently.
The two can be found on my GitHub: https://github.com/Mysticial/NumberFactory
Documentation for the library can be found here: http://www.numberworld.org/ymp/v1.0/
For now, the project is entirely experimental and is available only for 64-bit Windows with Visual Studio 2015. It is far from mature and there are no plans to support Linux in the near future. But at the very least, it will let people code things up that utilize y-cruncher's parallel large multiplication.
Version 0.6.9: (December 5, 2015) - permalink
Intel seems to be taking its fine time with the AVX512 stuff. So I guess that's not happening any time soon...
While that endless wait continues, there's some scalability improvements that have been sitting on my desk for months:
*The Linux versions of y-cruncher have historically been statically linked as a means to avoid the DLL dependency hell on Linux. Unfortunately, Intel does not provide a static library for Cilk Plus - thereby forcing dynamic linking. After fiddling with this for multiple weekends, I can't get anything that will run on just my 3 Linux boxes.
I give up. For now, y-cruncher for Linux will be available in both static and dynamic versions. The dynamic version will target a recent version of Ubuntu and will support Cilk Plus. The static version will run almost anywhere as before, but it lacks Cilk Plus.
Performance Announcement for Ubuntu 15.10: (November 8, 2015) - permalink
This applies to Ubuntu 15.10, but may also apply to other Linux distributions with the same kernel version.
When performing swap mode computations on Ubuntu 15.10, the OS has been observed to do excessive swap file access when y-cruncher is performing heavy I/O. This swapping is so severe that the OS becomes unresponsive and the computation stalls. It has not been observed in Ubuntu 15.04.
So far, the only solution that seems to work reliably is to completely disable the swap partition. Then reduce the amount of memory that y-cruncher should use so that you don't run out of memory. Another possible solution is to set the "swappiness" value to zero. But this is untested.
The next version (v0.6.9) will have a lower default memory setting for swap mode.
200 billion digits of Catalan's Constant: (June 8, 2015) - permalink
I'm pleased to announce that after running for more than half a year, Robert Setti has computed Catalan's Constant to 200 billion digits.
This is very impressive because Catalan's Constant is one of the slowest to compute. (among popular constants that can be computed in quasi-linear time)
Version 0.6.8 (fix 2): (May 7, 2015) - permalink
A lot of unexpected personal stuff happened this last month. I'll be starting a new job next week that is potentially much more stressful than ever before.
So I've decided to push out all the remaining bugfixes for v0.6.8. Depending on how things turn out, this may very well be the last version until the Skylake Xeon.
Version v0.6.8 Patched: (March 17, 2015) - permalink
I totally knew this would happen... The moment I rush a release (for Pi Day), something breaks.
It turns out there was a very large (up to 5%) performance regression on Haswell processors that scales inversely with the memory bandwidth of the system. Normally, such large regressions are caught long before they can be released. But since my primary test machine has 4 channels of overclocked DDR4, I never noticed the regression at all. It was only after Shigeru Kondo reported this did I test it on a different Haswell machine which did reproduce the regression.
This screw-up involved a cache optimization that was designed into the algorithm from the very beginning rather than being added later as a result of profiling. Being a premature optimization, it backfired for small computations and had no noticeable effect for large ones on my machine. Therefore I disabled it sometime between v0.6.7 and v0.6.8. Not a big deal, I expect some of these premature micro-optimizations to backfire.
Well... It turns out that the premature optimization wasn't really that premature after all. On large computations, it reduces the demand on memory bandwidth. If the processor is bandwidth-starved (such as mainstream Haswell), it translates to an increase in performance. Therefore, this patch re-enables the optimization on all processors except for AMD Bulldozer - which takes a 10% performance hit for some unknown reason.
The sad part is that, back in 2012, I did this optimization because I predicted that it would reduce the demand on memory bandwidth. But it wasn't until 2015 did the hardware and software become fast enough for it to actually make a difference. During those 3 years, I completely forgot about the optimization and left it in the "on" position until a recent refactoring touched it. That prompted me to re-evaluate it and (erroneously) disable it for the Pi day v0.6.8 release.
Pi Day of the Century: (March 14, 2015) - permalink
Surprise! There was no way I could possibly pass this day up right?
y-cruncher v0.6.8 is out with some new features:
Version 0.6.7 Released: (February 8, 2015) - permalink
It turns out that v0.6.6 had yet another serious bug that would cause a large multiplication to fail under the right circumstances. But at least this time, it wasn't related to swap mode. So after fixing that and cherrypicking it into v0.6.7 (along with 3 other things), I think we're good to go.
About that unstable workstation...
While the system still isn't entirely stable in Linux yet, it's in good enough shape to do longer running stuff.
This instability turned out to be a good opportunity to test the program's never-used RAID3 implementation. The hard drive configuration was 2 sets of 8 drives each in RAID3. One computation tolerated 9 hard drive read errors and still managed to finish correctly.
This entire instability mess has prompted me to update the user guide for Swap Mode with a new section.
Version 0.6.7 Preview: (January 27, 2015) - permalink
Version 0.6.7 has been built and is undergoing final testing. But I have no idea how long that will take. While everything looks good on Windows, testing on Linux is currently blocked while I diagnose an instability on my storage workstation with Linux.
The likely source of the instability is a massive hardware upgrade in December. Since Windows is fine and Linux is unstable, I suspect it's a driver issue. But I have yet to sort it out. The fact that I'm not much of a Linux person isn't really helping the situation.
In any case, part of that hardware upgrade involved adding 8 hard drives to the machine for a total of 16 drives. So v0.6.7 consists of mostly swap mode improvements that I decided to do after playing around with this 16 hard drive toy.
The main feature of v0.6.7 is a swap-mode multiplication tester which has two purposes:
The second point is important for anyone attempting world record computations. As the sole developer of y-cruncher, I only have the resources to test large multiplications up to around 5 trillion digits. Which means that I cannot reach the sizes that are now required to set Pi world records.
In the past, there have been bugs in the multiplication which only manifest at sizes that nobody has ever reached before. The scenario that I want to avoid is for someone else to spend months attempting a world record only to fail because of a bug in my code. In a sense, it's somewhat miraculous that y-cruncher is 4 for 4 in world record attempts for Pi. (i.e. no fatal software bugs)
Version 0.6.6 Patched (again): (December 21, 2014) - permalink
Version 0.6.6 is turning into one of the worst versions of y-cruncher. Once again, I've found some serious bugs that need to be fixed asap.
Because of the severity of the regressions that have been fixed, I highly recommend everyone who uses swap mode to upgrade to this patch (v0.6.6.9452).
In particular, the error-detection in v0.6.6 prior to this patch is so badly broken that it's as good as useless.
There's an on-going task to refactor the entire program from C -> idiomatic(ish) C++11 for better long-term maintainability. In v0.6.6, this refactor touched some of the fragile swap mode code - which is very difficult to test because it is slow, resource-intensive, and full of corner cases.
Version 0.6.6: (November 5, 2014) - permalink
This release is a bit premature since it hasn't been tested much yet. But it fixes a lot of bugs including one that causes all large Pi benchmarks to fail on the 6-core Haswell processors. So ready or not, it needs to be released. In addition to bug-fixes, v0.6.6 adds a couple of new features and some minor optimizations.
In memory of my grandfather who passed away last month. He loved numbers and is probably why I do too...
World Record - 13.3 trillion digits of Pi: (October 8, 2014) - permalink
I'm please to announce that "houkouonchi" (who wishes to remain anonymous) has set a new world record for the digits of Pi with 13,300,000,000,000 digits.
The computation took 208 days and was done using y-cruncher v0.6.3 on a workstation with the following specs:
Verification using the BBP formula was done by myself and took 182 hours on a Core i7 920 @ 3.5 GHz.
Overall, this computation was slower than Shigeru Kondo's 12.1 trillion because the machine had less disk bandwidth and was not dedicated to the task.
More details coming soon...
For now, the digits can be downloaded here*: http://fios.houkouonchi.jp:8080/pi/
You can contact houkouonchi at: email@example.com
*In order to view and/or decompress the digits, you will need the Digit Viewer. It comes bundled with y-cruncher.
Version 0.6.5: (May 26, 2014) - permalink
It took way too long, but support for AVX2 has been added. The new binary targets Haswell processors and requires AVX2, FMA3, and BMI2 instructions.
Theoretically, it should also be able to run on AMD Excavator processors.
As a word of warning: On Haswell, the AVX2 binary runs considerably hotter than with just AVX. So please take care when running it (with or without overclock).
This is especially the case with all the thermal problems that Haswell has.
Pi Day and Version 0.6.4: (March 14, 2014) - permalink
Oh hey look at the date! The long promised (and overdue) version for AMD processors is finally done. The "x64 XOP ~ Miyu" binary is optimized for AMD processors and uses FMA4 and XOP instructions. It will not run on Intel processors.
AVX2 is next on the list. But progress has been severely hindered by numerous issues with the Visual Studio compiler. VS2012 has severe bugs in its AVX2 code generation. VS2013 has a 10 - 30% performance regression in AVX code generation. Both versions generate terrible FMA3 code.
Long story short, expect the next version of y-cruncher to see the return of the Intel Compiler...
Version 0.6.3 Patched: (February 21, 2014) - permalink
While doing some performance tuning on the XOP binary for v0.6.4, I discovered a nasty bug. The bug was introduced in v0.6.3 and also affects the SSE3 binary.
So a patch has been released along with a couple other minor bugs that were discovered after v0.6.3 was released.
Version 0.6.3 is out: (December 29, 2013) - permalink
It had been sitting on my desk for quite a few months now. And now it's out. It adds the Lemniscate Constant and brings back the Euler-Mascheroni Constant.
There have also been a number of refactorings and re-tunings since v0.6.2. So expect some slight differences in performance (both up and down).
As some of you already know, AVX is slower than SSE for AMD processors. The reason for this is explained here.
Unfortunately, y-cruncher is no exception: The performance of the AVX binary is much worse than that of the SSE3 and SSE4.1 binaries. Therefore, the dispatcher for v0.6.3 has been reconfigured to fall back to SSE3 for all AMD processors even if they support AVX. It falls all the way back to SSE3 instead of SSE4.1 because the SSE4.1 binaries are tuned specifically for Intel processors and don't run as well on AMD processors.
A new binary (x64 XOP ~ Miyu) will be coming out in v0.6.4 that is specifically tuned for AMD Bulldozer and Piledriver. It will use SSE4.1, FMA4, and XOP instructions.
12.1 Trillion Digits of Pi: (December 28, 2013) - permalink
The 10 trillion digit record had been standing for 2 years and it didn't look like anybody was trying to beat it. So we threw y-cruncher v0.6.3 at it along with some new hardware. More details here.
y-cruncher v0.6.3 will be released in a few days. Still needs a bit more testing...
119 billion digits of Euler's Constant: (December 22, 2013) - permalink
Using a beta version of y-cruncher v0.6.3, it took 50 days to compute and 38 days to verify a computation of 119,377,958,182 digits of the Euler-Mascheroni Constant.
This is by far the longest computation I've ever attempted by myself using my own hardware. The main computation was interrupted multiple times due to overheating problems and a blown out power supply. After replacing the power supply and reseating the heatsinks, there were no more hardware issues. So the verification was able to run from start to finish in a single contiguous run lasting 38 days.
This is also the last long-running computation that will ever run on my aged workstation. Afterwards, the machine will be retired. I'll still be keeping it around, but I will no longer be running anything stressful on it.
User Guide for v0.6.x's Swap Mode: (November 24, 2013) - permalink
y-cruncher's swap mode got a lot more complicated in v0.6.x. The lack of documentation also made it a lot harder to use.
I've finally gotten around to writing a user guide for y-cruncher's swap mode functionality.
In the future, I may add more of these for other features of y-cruncher.
Version 0.6.2 (fix 1) Canceled - Digit Viewer Source Released: (October 9, 2013) - permalink
Stuff happens... :(
I was originally gonna release v0.6.2 (fix 1) back in August, but some of the code-refactoring that I did had touched a bit too much of the program. So I didn't feel it was stable enough for a public release. (Shigeru Kondo knows this pretty well after I sent him some broken binaries. :P)
So v0.6.2 (fix 1) will be skipped and everything will be pushed into v0.6.3.
In addition to everything that was supposed to be in v0.6.2 (fix 1), v0.6.3 will also have:
I don't have a timeline or a release date yet. There's still a lot of testing to be done and I have less free time than when I was still in school.
In the meantime, I've released the source code to the Digit Viewer on my GitHub.
This is the exact same source that will be used to compile the Digit Viewer binaries that will be released with v0.6.3.
Some random things... (July 14, 2013) - permalink
Any C++ programmers out there? I've been toying around with a "tiny" Pi program that can do millions of digits of Pi. Feel free to play with it.
It isn't very fast, but it hits all the necessary algorithms to get quasi-linear run-time.
I've found and fixed the problem with "O_DIRECT" on Linux. Getting that to work cut the CPU usage in half. While it's a decent improvement, it wasn't as good as I had expected. And after trying out numerous tweaks, that last chunk of CPU usage from the I/O threads won't disappear. So I'll let that rest.
The fix that solves the "O_DIRECT" issue will be rolled on the next patch: v0.6.2 (fix 1)
v0.6.2 for Linux (June 30, 2013) - permalink
After finally getting everything to work on Linux, I did some tests and noticed something that really bothered me: Large swap computations on Linux are significantly slower than on Windows.
Take a look:
Notice the difference. Same computer, same settings, everything is the same except for the OS.
After digging around I was finally able to trace the issue. It turns out that the I/O operations on Linux were using a lot of CPU. And I mean a LOT - as in half a core per hard drive. (8 hard drives = 4 cores) WTF?!?!
Why does this matter? Because the program overlaps disk I/O and computation. If disk I/O is using a lot of CPU, then the computation threads will be denied a lot of CPU time. In the test case above, the disk I/O threads hogged half of the 8 cores! On Windows, the disk I/O uses close to nothing and the computation threads get nearly all of the 8 cores to grind at.
On Windows, I use "FILE_FLAG_NO_BUFFERING" to DMA all the I/O operations and bypass the OS cache. So there is no overhead - and almost no CPU usage.
Likewise, on Linux, I use "O_DIRECT" to achieve the same thing.
However, it seems that the "O_DIRECT" flag has no effect. The performance is the same with or without it. Furthermore, it seems that I can pass in misaligned buffers and sizes. So in other words, the flag isn't working. If it was, it should fail on the misaligned parameters.
Until I can figure out what's preventing "O_DIRECT" from working, the Linux version will not perform as well as the Windows version.
This issue has probably existed since the v0.5.x, but I never did any serious benchmarks on Linux until now.
Other things: I still plan to open-source the digit viewer. But I need to clean up the code first. It's pretty unreadable to anyone other than me ATM.
There were older entries. But I no longer have a record of them... :(