y-cruncher - A Multi-Threaded Pi-Program
From a high-school project that went a little too far...
By Alexander J. Yee
(Last updated: September 17, 2017)
The first scalable multi-threaded Pi-benchmark for multi-core systems...
How fast can your computer compute Pi?
y-cruncher is a program that can compute Pi and other constants to trillions of digits.
It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.
y-cruncher has been used to set several world records for the most digits of Pi ever computed.
Windows: Version 0.7.3 Build 9475 (Released: September 14, 2017)
Linux : Version 0.7.3 Build 9475 (Released: September 14, 2017)
Official HWBOT thread.
Official XtremeSystems Forums thread.
Updates on Skylake X and Threadripper: (August 15, 2017)
This is a follow up to my analysis on the Skylake X processors as well a few notes about AMD Threadripper.
I've also released a patch (v0.7.3.9474) to fix some issues with the Linux binaries on Threadripper and Epyc.
Skylake X Follow Up:
It has been confirmed with benchmarks that all the Skylake X desktops have full-throughput FMA. This directly contradicts Intel's pre-release information. Subsequently, all the pre-release reviewers apparently got their information from the same inaccurate source from Intel. So if you are looking to purchase a Skylake X system for the purpose of AVX512, you do not need to spend $1000 to get the Core i9 7900X for the full AVX512. Either of the 6 or 8 core models (7800X and 7820X) will do.
The phantom throttling issues on Gigabyte motherboards still exist even after multiple BIOS updates. While there has been some change in the behavior of the throttling, little has been done to actually solve it. So manual intervention is still necessary to counter the throttling. The problem of throttling in general seems to be even worse in Linux. But I have yet to investigate this in depth. And as of now, I don't even know if the throttling in Linux is a phantom throttle or a normal throttle.
I look forward to see what phantom throttling looks like under the VTune profiler.
The "unknown bottlenecks" that I mentioned before is the L3 cache mesh. The L3 cache mesh on Skylake X only has about half the bandwidth of the L3 cache on the previous generation Haswell/Broadwell-EP processors. The Skylake X L3 cache is so slow that it's barely faster than main memory in terms of bandwidth. So for all practical purposes, it's as good as non-existant.
To illustrate how bad the L3 mesh is, this is what happened when I overclocked it:
1 billion digits of Pi - Core i9 7900X @ 3.8 GHz
y-cruncher v0.7.3 - Times in Seconds
|AVX2 (14-BDW)||AVX512 (17-SKX)|
|Memory\Mesh||2.4 GHz||2.8 GHz||3.2 GHz||2.4 GHz||2.8 GHz||3.2 GHz|
That's a 25% speedup just from overclocking the cache and memory while keeping the CPU frequency locked at 3.8 GHz. And we still haven't reached the point of diminishing returns. Above 3.2 GHz cache and 3400 MHz memory, the system started showing signs of instability. But I didn't try very hard to get it stable.
Overclocking the cache and memory also led to a disproportionally large increase in CPU temperatures and power consumption. This is probably due to the secondary effects of lifting the cache/memory bottlenecks which allow the code to run much more efficiently and intensively than before.
Running the L3 cache mesh at 3.2 GHz managed to trigger a small but noticeable amount of phantom throttling in a BIOS profile which I thought was resistant to it. So I actually had to redo all the benchmarks with an updated profile. In my previous phantom throttling test with the BBP benchmark, the throttling only affected the AVX512 workload. In this case, it affected both the AVX2 and AVX512 loads - but more so on the AVX512. The AVX2 runs never broke 220W on the CPU. But the AVX512 runs were consistently pulling around 270 - 320W. Prior to this, I had never observed any throttling below 260W. Assuming these power draw readings are reliable, it seems to suggest that the phantom throttling isn't entirely dependent on the total CPU power draw. So there are likely other (unknown) factors involved.
From the software optimization standpoint, the cache bottleneck brings a new set of difficulties. The L2 cache is fine. It is 4x larger than before and has doubled in bandwidth to keep up with the AVX512. But the L3 is useless. The net effect is that the usable cache per core is halved compared to the previous Haswell/Broadwell generations. Furthermore, doubling of the SIMD size with AVX512 makes the usable cache 4x smaller than before in terms of # of SIMD words that fit in cache.
The Skylake Purley binary (17-SKX) for y-cruncher v0.7.3 is currently tuned to 1 MB cache/logical-core. But since the L3 is useless, it should be 512 KB/logical-core. However, fixing this isn't as simple as changing one number in the source code and recompiling. The effect of the cache being "4x smaller" unfortunately puts it outside the domain of y-cruncher's tuning parameters. Fixing this to allow y-cruncher to run well on such a small cache will probably require uprooting a not-so-insignificant amount of code.
AMD Threadripper and NUMA:
Threadripper has been released and has brought 16-core processors to the consumer market. And there has been a lot of the talk has been about the NUMA the various memory modes, and how they affect performance.
I currently don't have a Threadripper machine to play with nor do I intend to buy/build one (I'm way over-budget on hardware this year). But as far as I can tell from the reviews, it's no different from multi-socket NUMA which has already existed for years:
y-cruncher has been NUMA-aware since v0.7.1. And starting from v0.7.3, it has been smart about allocating memory on NUMA systems.
y-cruncher prefers high bandwidth and is relatively insensitive to memory latencies. So node-interleaving is almost always better.
So in other words, it should not matter which memory mode you choose on Threadripper and Epyc since y-cruncher will automatically do the right thing. But if you wish to tinker with the memory allocation settings within y-cruncher, you can do that within the "Custom Compute" menu. (Further reading: Memory Allocation)
To summarize, no major update is needed for y-cruncher to support Threadripper's NUMA modes. Only a patch was needed to make it properly detect the NUMA topology on Threadripper and Epyc under Linux. The Windows binaries should be fine without the patch.
This has been a crazy year and it's not done yet as we're expecting the high-core-count Skylake X chips to arrive in September.
Overall, y-cruncher has actually been better prepared for Ryzen/Threadripper than Skylake X/AVX512. AMD Zen didn't bring anything new in terms of processor features. So little needed to be done on the low-level optimization side. The NUMA stuff is something that y-cruncher already supported since multi-socket servers has always been one of y-cruncher's intended use cases.
The only unexpected bottleneck with Zen was the memory bandwidth - one that was largely unactionable.
On the other hand, Skylake X brings AVX512 which led to a massive domino effect of bottlenecks and performance issues which I was completely unprepared for.
y-cruncher has been used to set a number world record size computations.
Blue: Current World Record
Green: Former World Record
Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.
|Date Announced||Date Completed:||Source:||Who:||Constant:||Decimal Digits:||Time:||Computer:|
|August 24, 2017||August 23, 2017||Ron Watkins||Euler-Mascheroni Constant||477,511,832,674||4 x Xeon E5-4660 v3 @ 2.1 GHz - 1 TB
2 x Xeon X5690 @ 3.47 GHz - 128 GB
|August 14, 2017||August 13, 2017||Ron Watkins||Zeta(3) - Apery's Constant||500,000,000,000||
8 x Xeon 6550 @ 2.0 GHz - 512 GB
2 x Xeon X5690 @ 3.46 GHz - 142 GB
|November 15, 2016||November 11, 2016||Blog
|Peter Trueb||Pi||22,459,157,718,361||Compute: 105 days||4 x Xeon E7-8890 v3 @ 2.50 GHz
1.25 TB DDR4
20 x 6 TB 7200 RPM Seagate
|September 3, 2016||August 29, 2016||Ron Watkins||e||5,000,000,000,000||2 x Xeon X5690 @ 3.47 GHz
|July 11, 2016||July 5, 2016||"yoyo"||Golden Ratio||10,000,000,000,000||
|2 x Intel Xeon E5-2696 v4 @ 2.2 GHz
|June 28, 2016||June 19, 2016||Ron Watkins||Square Root of 2||10,000,000,000,000||2 x Xeon X5690 @ 3.47 GHz
|June 4, 2016||May 29, 2016||Ron Watkins||Lemniscate||250,000,000,000||4 x Xeon E5-4660 v3 @ 2.1 GHz - 1TB
4 x Xeon X6550 @ 2 GHz - 512 GB
|June 4, 2016||June 2, 2016||"yoyo"||Golden Ratio||5,000,000,000,000||
|2 x Intel Xeon E5-2696 v4 @ 2.2 GHz
|April 24, 2016||April 18, 2016||Ron Watkins||Log(2)||500,000,000,000||4 x Xeon X5690 @ 3.47 GHz - 141 GB|
|April 17, 2016||April 12, 2016||Ron Watkins||Catalan's Constant||250,000,000,000||4 x Xeon E5-4660 v3 @ 2.1 GHz
|April 9, 2016||April 3, 2016||Ron Watkins||Log(10)||500,000,000,000||2 x Xeon X5690 @ 3.47 GHz
|February 8, 2016||February 6, 2016||Mike A||Catalan's Constant||500,000,000,000||
|2 x Intel Xeon E5-2697 v3 @ 2.6 GHz
|July 24, 2015||July 22, 2015
July 23, 2015
|Golden Ratio||2,000,000,000,000||4 x Xeon X6550 @ 2 GHz - 512 GB
Xeon E5-2676 v3 @ 2.4 GHz - 64 GB
|October 8, 2014||October 7, 2014||"houkouonchi"||Pi||13,300,000,000,000||2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
|December 28, 2013||December 28, 2013||Source||Shigeru Kondo||Pi||12,100,000,000,050||2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB
See the complete list including other notably large computations.
If you wish to set a record, you must run two computations using different formulas (one to compute, the other to verify). Then send me the validation files, but do not make any attempt to modify them. The validation files are protected with a checksum to prevent tampering/cheating. Yes, people have tried to cheat before.
An exception to the "two computations rule" can be made for Pi since it can be verified using BBP formulas.
Note that for anyone attempting to set a Pi world record: Should the attempt succeed, I kindly ask that you make yourself sufficiently available for external requests to access or download the digits in its entirety (at least until it is broken again by someone else). Pi is popular enough that people do actually want to see the digits.
Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.
Sample Screenshot: 100 billion digits of Pi
|Core i7 5960X @ 4.0 GHz - 128GB DDR4 @ 2666 MHz - 16 HDs|
Latest Releases: (September 14, 2017)
OS Programs Download Link Size
y-cruncher + HWBOT Submitter
HWBOT Submitter Only
The Linux version comes in both statically and dynamically linked versions. The static version should work on most Linux distributions, but lacks Cilk Plus and NUMA binding. The dynamic version supports all features, but is less portable due to the DLL dependency hell.
The HWBOT submitter allows y-cruncher benchmarks to be submitted to HWBOT - which is a competitive overclocking site. It is currently only available for Windows.
- Windows Vista or later.
- The HWBOT submitter requires the Java 8 Runtime.
- 64-bit Linux is required. There is no support for 32-bit.
- The dynamic version has been tested on Ubuntu 17.04.
- An x86 or x64 processor.
Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to even before Windows XP.
Other Downloads (for C++ programmers):
|1 Billion digits of Pi (times in seconds)|
|Intel Core i7 4770k||4/8||4.0 GHz||Windows 10||477.280||111.295||4.29x|
|AMD FX-8350||8/8||4.0 GHz||Windows 10||1215.302||243.294||5.00x|
|Intel Core i7 5960X||8/16||4.0 GHz||Windows 7||483.574||57.867||8.36x|
So while it may be difficult to believe, Windows is currently the more suitable OS for running y-cruncher.
Comparison Chart: (Last updated: August 13, 2017)
Computations of Pi to various sizes. All times in seconds. All computations done entirely in ram.
The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.
Laptops + Low-Power:
|Processor(s):||Core i7 3630QM||VIA C46501||Xeon E3-1535M v52||Core i7 6820HK||Pentium N42001|
|Generation:||Intel Ivy Bridge||VIA Isaiah||Intel Skylake||Intel Skylake||Intel Apollo Lake|
|Processor Speed:||3.2 GHz||2.0 GHz||2.9 GHz||3.2 GHz||1.1 - 2.5 GHz|
|Memory:||8 GB - 1600 MHz||16 GB||16 GB||48 GB - 2133 MHz||4 GB|
|Version:||v0.7.2 - AVX||v0.7.2 - AVX||v0.7.1 - ADX||v0.7.2 - ADX||v0.7.2 - SSE4.1|
1Credit to Tralalak.
2Credit to Kaupo Karuse.
|Processor(s):||Core 2 Quad Q6600||Core i7 920||FX-8350||Core i7 4770K||Core i7 5775C1||Core i7 7700K2||Ryzen 7 1800X|
|Generation:||Intel Core||Intel Nehalem||AMD Piledriver||Intel Haswell||Intel Broadwell||Intel Kaby Lake||AMD Zen|
|Processor Speed:||2.4 GHz||3.5 GHz (OC)||4.0 GHz||4.0 GHz (OC)||3.8 GHz (OC)||4.8 GHz (OC)||3.7 GHz|
|Memory:||6 GB - 800 MHz||12 GB - 1333 MHz||32 GB - 1333 MHz||32 GB - 2133 MHz||16 GB - 2400 MHz||64 GB - 3000 MHz||64 GB - 2133 MHz|
|Version:||v0.7.2 - SSE3||v0.7.2 - SSE4.1||v0.7.2 - XOP||v0.7.2 - AVX2||v0.7.1 - ADX||v0.7.1 - ADX||v0.7.2 - ADX|
1Credit to André Bachmann.
2Credit to Oliver Kruse.
|Processor(s):||Core i7 5820K1||Core i7 5960X||Threadripper 1950X2||Core i9 7900X|
|Generation:||Intel Haswell||Intel Haswell||AMD Threadripper||Intel Skylake Purley|
|Processor Speed:||4.5 GHz (OC)||4.0 GHz (OC)||4.0 GHz (OC)||
|2.4 GHz cache||3.0 GHz cache|
|Memory:||32 GB - 2400 MHz||128 GB - 2666 MHz||64 GB - 2800-3200 MHz||128 GB - 3200 MHz||128 GB - 3400 MHz|
|Version:||v0.7.3 - AVX2||v0.7.2 - AVX2||v0.7.3 - ADX||v0.7.3 - AVX512-DQ|
1Credit to Sean Heneghan.
2Credit to Oliver Kruse.
Due to high core count and the effect of NUMA (Non-Uniform Memory Access), performance on multi-processor systems are extremely sensitive to various settings. Therefore, these benchmarks may not be entirely representative of what the hardware is capable of.
For example, enabling node-interleaving in the BIOS can improve performance by around 2x. But tweaks like these are often not possible as many of these systems corporate or university machines that are heavily locked down and do not provide the user with sufficient access privileges. Furthermore, due the exponentially large space of settings and configurations, it's often difficult to find the optimal set of settings.
|Processor(s):||Xeon X5482||Xeon E5-26901||Xeon E5-2683 v31||Xeon E5-2630 v42||Xeon E5-2696 v43||Xeon E7-8880 v34||Epyc 76015||Xeon Gold 6130F5|
|Generation:||Intel Penryn||Intel Sandy Bridge||Intel Haswell||Intel Broadwell||Intel Broadwell||Intel Haswell||AMD Naples||Intel Skylake Purley|
|Processor Speed:||3.2 GHz||3.5 GHz||2.03 GHz||2.2 GHz||2.2 GHz||2.3 GHz||2.2 GHz||2.1 GHz|
|Memory:||64 GB - 800 MHz||256 GB - ???||128 GB - ???||64 GB - 2133 MHz||768 GB - ???||2 TB - ???||256 GB - ??||256 GB - ??|
|Version:||v0.7.2 - SSE4.1||v0.6.2/3 - AVX||v0.6.9 - AVX2||v0.7.3 - ADX||v0.7.1 - ADX||v0.7.1 - AVX2||v0.7.3 - ADX||v0.7.3 - AVX512-DQ|
1Credit to Shigeru Kondo.
2Credit to Cameron Giesbrecht.
3Credit to "yoyo".
4Credit to Jacob Coleman.
5Credit to Dave Graham.
I've been asked a few times on what benchmarks quality for these tables. But there aren't any specific rules. For the most part, I try to maximize the variety of processors on the list. So I won't put more than one system in each processor line unless they have drastically different capabilities such as core count. I also have a strong preference for systems that are at the top of their line and have as much memory as possible.
Perhaps the most important part is that the benchmarks are representative of the hardware. If there is any evidence of interference that may cause the hardware to perform suboptimally, they will be excluded. Examples of this include (but are not limited to), underclocking, disabled cores, disabled hyperthreading, disabled AVX, fewer than all memory channels, background programs, thermal throttling, using an outdated version of y-cruncher, etc... Some leeway is given to multi-processor servers since they are so sensitive to numerous factors.
Likewise, absurdly high overclocks will be excluded. These tables are meant to compare systems running at real life speeds. Benchmarks done with extreme overclocks (especially with liquid nitrogen) should go on HWBOT. Just be aware that HWBOT has stringent rules on submissions since it's competitive.
The full chart of rankings for each size can be found here:
These fastest times may include unreleased betas.
Got a faster time? Let me know: firstname.lastname@example.org
Note that I usually don't respond to these emails. I simply put them into the charts which I update periodically.
Pi and other Constants:
Hardware and Overclocking:
Here's some interesting sites dedicated to the computation of Pi and other constants:
Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.