y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: April 3, 2024)

Shortcuts:

Numberworld Home

Records Set by y-cruncher
Download y-cruncher
Features
Benchmarks
Fastest Times
Performance Tips
Known Issues
Algorithms
FAQ
Links
Contact Me

The first scalable multi-threaded Pi-benchmark for multi-core systems...

How fast can your computer compute Pi?

y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.

y-cruncher has been used to set several world records for the most digits of Pi ever computed.

Current Release:

Windows: Version 0.8.4 Build 9538 (Released: February 22, 2024)

Linux : Version 0.8.4 Build 9538 (Released: February 22, 2024)

Official Mersenneforum Subforum.

Official HWBOT forum thread.

News:

Countering the Compiler Regression with Optimizations: (April 3, 2024) - permalink

These kind of topics are hard to write about since it's not all positive. But let's start with a table because everyone hates walls of text:

Processor	Architecture	Clock Speeds	Binary	ISA	1 Billion Digits of Pi (times in seconds)
Processor	Architecture	Clock Speeds	Binary	ISA	v0.8.4	v0.8.5 (ICC)	v0.8.5 (ICX)	v0.8.4 -> v0.8.5	ICC -> ICX	Overall
Core i7 920	Intel Nehalem	3.5 GHz + 3 x 1333 MT/s	08-NHM	x64 SSE4.1	535.818	492.971	482.982	+8.00%	+2.03%	+9.86%
Core i7 3630QM	Intel Ivy Bridge	stock + 2 x 1600 MT/s	11-SNB	x64 AVX	339.96	318.037	305.360	+6.45%	+3.99%	+10.18%
FX-8350	AMD Piledriver	stock + 2 x 1600 MT/s	12-BD2	x64 FMA3	225.749	218.338	216.159	+3.28%	+1.00%	+4.25%
Core i7 5960X	Intel Haswell	4.0 GHz + 4 x 2400 MT/s	13-HSW	x64 AVX2	49.441	48.568	50.205	+1.77%	-3.37%	-1.55%
Core i7 6820HK	Intel Skylake	stock + 2 x 2133 MT/s	14-BDW	x64 AVX2 + ADX	102.144	100.887	103.570	+1.23%	-2.66%	-1.40%
Ryzen 7 1800X	AMD Zen 1	stock + 2 x 2866 MT/s	17-ZN1	x64 AVX2 + ADX	77.505	75.965	76.800	+1.99%	-1.10%	+0.91%
Core i9 7940X	Intel Skylake X	3.6 GHz (AVX512) + 4 x 3466 MT/s	17-SKX	x64 AVX512-DQ	20.686	19.912	20.428	+3.74%	-2.59%	+1.25%
Ryzen 9 3950X	AMD Zen 2	stock + 2 x 2666 MT/s	19-ZN2	x64 AVX2 + ADX	34.814	33.292	33.161	+4.37%	+0.39%	+4.75%
Core i7 11800H	Intel Tiger Lake	stock + 2 x 3200 MT/s	18-CNL	x64 AVX512-VBMI	35.739	34.438	35.052	+3.64%	-1.78%	+1.92%
Ryzen 9 7950X	AMD Zen 4	stock + 2 x 5000 MT/s	22-ZN4	x64 AVX512-GFNI	18.978	18.848	18.937	+0.69%	-0.47%	+0.22%

ICC is Intel's old Classic C++ Compiler (ICC). And ICX is Intel's new LLVM Compiler (ICX). So we can see that:

y-cruncher v0.8.5 will have new software optimizations that improves performance on all processors.
Intel's new compiler (ICX) is worse than their old compiler (ICC) on nearly all modern processors.

In short, Intel's new compiler is causing a performance regression in y-cruncher. In order to prevent the next version of y-cruncher from actually getting slower, I am trying to offset the regressions with new performance optimizations - with only partial success so far.

But how did we get here?

Intel's classic C++ compiler has historically been the best compiler for code performance. However, starting from about 2020, Intel began migrating to a new LLVM-based compiler (ICX) which they wrapped up last year by discontinuing their old compiler (ICC). The problem is that for y-cruncher at least, ICX isn't actually better than ICC.

Processor	Architecture	Clock Speeds	Binary	ISA	BBP - 10 billionth Hex Digit of Pi (times in seconds)
Processor	Architecture	Clock Speeds	Binary	ISA	MSVC 17.7.1	ICC 19.2	ICX 2024	ICC -> ICX
Core i7 920	Intel Nehalem	3.5 GHz	08-NHM	x64 SSE4.1	568.384	574.745	725.910	-26.30%
Core i7 3630QM	Intel Ivy Bridge	stock	11-SNB	x64 AVX	525.811	436.337	464.628	-6.48%
FX-8350	AMD Piledriver	stock	12-BD2	x64 FMA3	251.695	231.205	235.828	-2.00%
Core i7 5960X	Intel Haswell	4.0 GHz	13-HSW	x64 AVX2	55.249	50.640	53.422	-5.49%
Core i7 6820HK	Intel Skylake	stock	14-BDW	x64 AVX2	107.977	105.307	108.959	-3.47%
Ryzen 7 1800X	AMD Zen 1	stock	17-ZN1	x64 AVX2	97.809	97.915	95.269	+2.70%
Core i9 7940X	Intel Skylake X	3.6 GHz (AVX512)	17-SKX	x64 AVX512-DQ	13.518	13.561	15.340	-13.12%
Ryzen 9 3950X	AMD Zen 2	stock	19-ZN2	x64 AVX2	22.506	21.043	20.982	+0.29%
Core i7 11800H	Intel Tiger Lake	stock	18-CNL	x64 AVX512-DQ	50.002	50.798	51.654	-1.69%
Ryzen 9 7950X	AMD Zen 4	stock	22-ZN4	x64 AVX512-DQ	11.521	11.424	12.232	-7.07%

In other words, Intel got rid of their old compiler while their new compiler has yet to match it in performance. And because of the need to stay up-to-date with C++ features and CPU instruction sets, I cannot stay on an old compiler forever. Thus an "upgrade" is inevitable even if that hurts performance.

What about other compilers? If Intel's new compiler is bad, what about other alternatives? Well...

Microsoft's compiler is overall worse than both of Intel's compilers.
GCC is unusable on Windows due to this 12-year-old bug which shows no sign of ever being fixed.
Clang I have not tested, but it is also LLVM-based like ICX. So I would be surprised if it was any better than ICX.

So even though Intel has made their compiler worse, it's still better than its competitors.

So why is Intel's new compiler worse than their old compiler?

There is no single regression in Intel's LLVM compiler that is responsible for the entire regression vs. their classic compiler. It's a combination of many regressions (and improvements) that collectively add up with the regressions winning in the end by several %. Anecdotally speaking, small regressions tended to involve inferior instruction selection and ordering while larger regressions tended to fall into these categories:

Poor register allocation. Code that has too many live values to fit in registers tended to spill more on ICX than ICC. While this has improved in ICX2024, ICC remains superior here.

ICX fails to align the stack for vector register spills. For large functions with many vector register spills, ICX does not align the spilled values. The result is misalignment penalties for register spills which is further aggravated by poor register allocation. The only way to align the stack requires using a compiler option that vendor-locks the code to Intel processors. Meaning that if you turn it on, the binary will not run on AMD processors.

Loop-invariant Code Motion (LICM) and Common Subexpression Elimination (CSE) often combine to blow up the stack and code size.

ICX is overly aggressive with loop optimizations that cannot be turned off without completely disabling optimizations.

This last category is particularly nasty. Complicated loop optimizations like loop-interchange, loop fusion/fission, loop materialization, aggressive loop unrolling, etc... are only turned on at maximum optimization level (O3) for most compilers due to their high risk of backfiring. However, I've observed that most of these are already enabled at O1 and O2 for ICX and are difficult or impossible to disable. And when such optimizations backfire, it can kill performance of the loop easily by a factor of 3x or more.

Below are some pseudo-code examples illustrating major ways that I have observed ICX loop optimizations to backfire. Actual code that experiences such behavior are generally much larger and more complicated in size. Self-contained samples have been provided to Intel's engineers in the hope that they can improve their compiler.

Example 1: Loop Fusion Gone Bad

double* A = ...;

for (size_t c = 0; c < length; c++){

double tmp = A[c];

// Long dependency chain.

A[c] = tmp;

}

for (size_t c = 0; c < length; c++){

double tmp = A[c];

// Long dependency chain.

A[c] = tmp;

}

In this example, the iterations of each loop are all independent and can be run in parallel. But within each iteration is a long dependency chain. In order to keep the iterations within the CPU reorder window, the work is intentionally split into multiple loops (more than just 2 loops as shown here). This allows the CPU to reorder across iterations - thus allowing instruction level parallelism.

However, ICX doesn't always allow this to happen. Instead, it sometimes decides to undo my hand optimization by fusing the loops back together into this:

for (size_t c = 0; c < length; c++){

double tmp = A[c];

// Super long dependency chain.

A[c] = tmp;

}

While this improves memory locality by traversing the array only once instead of twice. It has increased the dependency chain to the point that the CPU is no longer able to sufficiently reorder across iterations. Thus it kills instruction-level parallelism (ILP) and hurts performance. The compiler may be incorrectly assuming that the dataset does not fit in cache when in fact it does.

The same situation can happen with loop-interchange where ICX will interchange loops to improve memory locality at the cost of creating dependency chains that wipe out instruction level parallelism.

Example 2: Everything Blows Up

This example is a pathologically bad case where Loop-invariant Code Motion (LICM) and loop unrolling combine to create a perfect storm that simultaneously blows up instruction cache, data cache, and performance. While it looks rather specific, it is nevertheless a common pattern in y-cruncher.

Here the code is iterating an array of AVX512 vectors that uses 1000 scalar weights. Each time a scalar weight is used, it is broadcast to a full vector to operate on the array A. In AVX512, a scalar broadcast has the same cost as a full vector load. So there is no added cost of redoing the broadcast in the inner loop.

const double* weights = ...;

__m512d* A = ...;

for (size_t c = 0; c < length; c++){

__m512d tmp = A[c];

for (size_t w = 0; w < 1000; w++){

__m512d weight = _mm512_set1_pd(weights[w]); // Scalar Broadcast. Same cost as regular load.

// Do something with "tmp" and "weight".

}

A[c] = tmp;

}

Instead, ICX has a tendency to turn it into the following:

const double* weights = ...;

__m512d* A = ...;

__m512d expanded_weights0 = _mm512_set1_pd(weights[0]); // Each of these is 64 bytes!

__m512d expanded_weights1 = _mm512_set1_pd(weights[1]);

__m512d expanded_weights2 = _mm512_set1_pd(weights[2]);

...

__m512d expanded_weights999 = _mm512_set1_pd(weights[999]);

for (size_t c = 0; c < length; c++){

__m512d tmp = A[c];

// Do something with "tmp" and "expanded_weights0".

// Do something with "tmp" and "expanded_weights1".

// Do something with "tmp" and "expanded_weights2".

// ...

// Do something with "tmp" and "expanded_weights999".

A[c] = tmp;

}

What was supposed to be a bunch of (free) scalar broadcasts has turned into 64 KB of stack usage and two fully unrolled 1000-iteration loops - one of which is completely useless. In this example, this transformation is never beneficial as broadcast loads are already free to begin with. So replacing them with stack spills and trashing both the data and instruction caches only makes things worse. For small values of length, this transformation is devastating to performance due to the initial setup.

So what happened?

The compiler first sees that the inner loop has a compile-time trip count. So it decides it can completely unroll it. I have never seen compilers completely unroll loops this large, but ICX apparently does it with several of y-cruncher's kernels.
The compiler deduces that weights does not alias with A. Thus it sees that the loads and scalar broadcasts are loop invariant. So it pulls them out of the loop. Yes, all 1000 of them.
Those 1000 values need to go somewhere right? So it spills them onto the stack (and also incurring any penalties due to stack misalignment).

To put it simply, other compilers do not do this kind of stuff. Or at least they have limits to prevent this from happening. ICX appears to be completely unrestrained.

A common theme among ICX misoptimizations is that Loop Invariant Code Motion (LICM) and Common Subexpression Elimination (CSE) will create additional live values that end up getting spilled to the stack, thus invoking a penalty that is often larger than the initial savings. The example above is a cherry-picked example where ICX takes this concept to the extreme resulting in an avalanche of secondary regressions such as misalignment penalties and cache pollution.

Conclusion:

Intel's LLVM compiler is undoubtly a very powerful compiler. And the more I study it, the more I am impressed with its ability. However, with power comes responsibility, and unfortunately I cannot say that ICX wields this power well. I have yet to investigate if these issues are in LLVM itself or in Intel's modifications to it. But regardless, as of today, Intel's LLVM compiler can be best described as a child running with scissors - young and reckless with dangerous tools.

How long will it take for ICX to reach ICC's qualty of code generation? I have no idea. And after waiting more than a year for this to happen, I've decided that it's probably not going to happen for a very long time. For every thing that ICX screws up, it probably gets 5 others right. But for code that has already been hand-optimized, getting it right is neutral while getting it wrong hurts a lot. Dropping down to assembly is not an option because there are "thousands" of distinct kernels which are largely generated via template metaprogramming.

Is y-cruncher the only application affected like this? Probably not.

Older News

Records Set by y-cruncher:

y-cruncher has been used to set a number of world record sized computations.

Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced	Date Completed:	Source:	Who:	Constant:	Decimal Digits:	Time:	Computer:
March 14, 2024	February 27, 2024	Source	Jordan Ranous Kevin O’Brien Brian Beeler (StorageReview)	Pi	105,000,000,000,000	Compute: 75 days Verify: 4 days Validation File	2 x AMD Epyc 9754 1.5 TB 960 TB storage
February 13, 2024	February 12, 2024		Jordan Ranous	Log(2)	3,000,000,000,000	Compute: 42.7 hours Verify: 58.3 hours	2 x Intel Xeon Platinum 8460H 512 GB
January 17, 2024	January 10, 2023		Mamdouh Barakat	Zeta(5)	250,000,000,000	Compute: 6.02 days *Not Verified*	Intel Xeon Gold 5412U 125 GB
January 17, 2024	December 12, 2023		Jordan Ranous	Gamma(1/4)	1,000,000,000,000	Compute: 22.6 hours Verify: 22.8 hours	2 x Intel Xeon Platinum 8450H 512 GB
December 26, 2023	December 24, 2023		Jordan Ranous	e	35,000,000,000,000	Compute: 94.5 hours Verify: 92.5 hours	2 x Intel Xeon Platinum 8460H 512 GB
December 26, 2023	December 25, 2023		Jordan Ranous	Square Root of 2	20,000,000,000,000	Compute: 29.2 hours Verify: 21.6 hours	Intel Xeon Platinum 8450H 512 GB Intel Xeon Platinum 8460H 512 GB
December 26, 2023	December 22, 2023		Andrew Sun	Zeta(3) - Apery's Constant	2,020,569,031,595	Compute: 5.61 days Verify: 5.93 days	Intel Xeon Platinum 8347C 505 GB Intel Xeon Platinum 8347C 507 GB
December 18, 2023	December 15, 2023		Jordan Ranous	Gamma(1/3)	1,000,000,000,000	Compute: 17.5 hours Verify: 23.3 hours	2 x Intel Xeon Platinum 8450H 512 GB
December 18, 2023	December 11, 2023		Jordan Ranous	Zeta(5)	201,000,001,300	Compute: 29.9 hours Verify: 23.5 hours	2 x AMD EPYC 9754 1.5 TB
December 2, 2023	November 27, 2023		Jordan Ranous	Golden Ratio	20,000,000,000,000	Compute: 76.1 hours Verify: 30.0 hours	AMD Epyc 9654 - 1.5 TB Intel Xeon Platinum 8450H
September 9, 2023	September 7, 2023		Andrew Sun	Euler-Mascheroni Constant	1,337,000,000,000	Compute: 28.5 days Verify: 41.3 days	Intel Xeon Platinum 83470C 400 GB
July 17, 2022	July 15, 2022		Seungmin Kim	Lemniscate	1,200,000,000,100	Compute: 32.2 days Verify: 46.5 days	2 x Intel Xeon Gold 6140 377 GB
June 8, 2022	March 21, 2022		Emma Haruka Iwao	Pi	100,000,000,000,000	Compute: 158 days Verify: 12.6 hours Validation File	128 vCPU Intel Ice Lake (GCP) 864 GB 663 TB storage
March 14, 2022	March 9, 2022		Seungmin Kim	Catalan's Constant	1,200,000,000,100	Compute: 48.6 days Verify: 47.3 days	2 x Intel Xeon Gold 6140 2 x Intel Xeon E5-2680 v3
August 17, 2021	August 14, 2021	Source	UAS Grisons	Pi	62,831,853,071,796	Compute: 108 days Verify: 34.4 hours	AMD Epyc 7542 1 TB 34 + 4 Hard Drives
September 13, 2020	September 6, 2020		Seungmin Kim	Log(10)	1,200,000,000,100	Compute: 14.5 days Verify: 22.5 days	2 x Intel Xeon E5-2699 v3 756 GB 2 x Intel Xeon Gold 5220 754 GB
January 29, 2020	January 29, 2020	Blog	Timothy Mullican	Pi	50,000,000,000,000	Compute: 303 days Verify: 17.2 hours Validation File	4 x Intel Xeon E7-4880 v2 315 GB 48 Hard Drives
March 14, 2019	January 21, 2019	Blogs 1 + 2	Emma Haruka Iwao	Pi	31,415,926,535,897	Compute: 121 days Verify: 20.0 hours Validation File	2 x Undisclosed Intel Xeon > 1.40 TB DDR4 > 240 TB SSD
November 15, 2016	November 11, 2016	Blog Sponsor	Peter Trueb	Pi	22,459,157,718,361	Compute: 105 days Verify: 28 hours Validation File	4 x Xeon E7-8890 v3 1.25 TB DDR4 20 x 6 TB 7200 RPM Seagate
October 8, 2014	October 7, 2014		Sandon Van Ness (houkouonchi)	Pi	13,300,000,000,000	Compute: 208 days Verify: 182 hours Validation File	2 x Xeon E5-4650L 192 GB DDR3 @ 1333 MHz 24 x 4 TB + 30 x 3 TB
December 28, 2013	December 28, 2013	Source	Shigeru Kondo	Pi	12,100,000,000,050	Compute: 94 days Verify: 46 hours	2 x Xeon E5-2690 128 GB DDR3 @ 1600 MHz 24 x 3 TB

See the complete list including other notably large computations. If you want to set a record yourself, the rules are in that link.

Features:

The main computational features of y-cruncher are:

Able to compute Pi and other constants to trillions of digits.
Two algorithms are available for most constants. One for computation and one for verification.
Multi-Threaded - Multi-threading can be used to fully utilize modern multi-core processors without significantly increasing memory usage.
Vectorized - Able to fully utilize the SIMD capabilities for most processors. (SSE, AVX, AVX512, etc...)
Swap Space management for large computations that require more memory than there is available.
Multi-Hard Drive - Multiple hard drives can be used for faster disk swapping.
Semi-Fault Tolerant - Able to detect and correct for minor errors that may be caused by hardware instability or software bugs.

Download:

Sample Screenshot: 1 trillion digits of Pi

Core i7 5960X @ 4.0 GHz - 64 DDR4 @ 2400 MHz - 16 HDs

Latest Releases: (February 22, 2024)

Downloading any of these files constitutes as acceptance of the license agreement.

OS Download Link Size

Windows

y-cruncher v0.8.4.9538a.zip
35.0 MB

Linux (Static)

y-cruncher v0.8.4.9538-static.tar.xz
26.7 MB

Linux (Dynamic)

y-cruncher v0.8.4.9538-dynamic.tar.xz
19.0 MB

Downloads can also be found on GitHub. Use this if you prefer HTTPS.

The Linux version comes in both statically and dynamically linked versions. The static version should work on most Linux distributions, but lacks TBB and NUMA binding. The dynamic version supports all features, but is less portable due to the DLL dependency hell.

HWBOT submission is back with this release. So I expect the leaderboards to be rewritten soon.

System Requirements:

Windows:

Windows 7 or later.

The HWBOT submitter requires the Java 8 Runtime.

Linux:

64-bit Linux is required. There is no support for 32-bit.

The dynamic version has been tested on Ubuntu 22.04.

All Systems:

An x86 or x64 processor.

Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to even before Windows XP.

Version History:

Other Downloads (for C++ programmers):

Advanced Documentation:

Benchmarks:

Comparison Chart: (Last updated: July 11, 2023)

Computations of Pi to various sizes. All times in seconds. All computations done entirely in ram.

The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.

Blue: Benchmarks are up-to-date with the latest version of y-cruncher.

Green: Benchmarks were done with an old version of y-cruncher that is comparable in performance with the current release.

Red: Benchmarks are significantly out-of-date due to being run with an old version of y-cruncher that is no longer comparable with the current release.

Purple: Benchmarks are from unreleased internal builds that are not speed comparable with the current release.

Laptops + Low-Power:

Processor(s):	Core i7 6820HK	Core i7 11800H	Core i7 11800H
Generation:	Intel Skylake	Intel Tiger Lake	Intel Tiger Lake
Cores/Threads:	4/8	8/16	8/16
Processor Speed:	3.2 GHz (stock)	~2.5 GHz (45W PL)	~3.0 GHz (60W PL)
Memory:	64 GB @ 2133 MT/s	64 GB @ 3200 MT/s	64 GB @ 3200 MT/s
Version:	v0.8.1 (14-BDW)	v0.8.1 (18-CNL)	v0.8.1 (18-CNL)
Instruction Set:	x64 AVX2 + ADX	x64 AVX512-VBMI	x64 AVX512-VBMI
25,000,000	1.500	0.655	0.530
50,000,000	3.307	1.406	1.125
100,000,000	7.238	3.005	2.447
250,000,000	20.596	8.576	6.855
500,000,000	45.967	19.747	15.356
1,000,000,000	102.885	42.727	34.308
2,500,000,000	290.824	123.523	96.918
5,000,000,000	640.506	247.705	218.782
10,000,000,000	1,391.204	526.212	480.197
Credit:

Processor(s):	Core i3 8121U			Core i7 11800H
Generation:	Intel Cannon Lake			Intel Tiger Lake
Cores/Threads:	2/4			8/16
Processor Speed:	~2.5 - 3.2 GHz (stock)			~2.5 - 2.8 GHz (45W PL)
Memory:	8 GB @ 2400 MT/s			64 GB @ 3200 MT/s
Version:	v0.8.1 (14-BDW)	v0.8.1 (17-SKX)	v0.8.1 (18-CNL)	v0.8.1 (14-BDW)	v0.8.1 (17-SKX)	v0.8.1 (18-CNL)
Instruction Set:	x64 AVX2 + ADX	x64 AVX512-DQ	x64 AVX512-VBMI	x64 AVX2 + ADX	x64 AVX512-DQ	x64 AVX512-VBMI
25,000,000	2.857	2.467	1.988	0.907	0.853	0.655
50,000,000	6.446	5.501	4.392	2.075	1.862	1.406
100,000,000	14.335	12.257	9.490	4.176	3.749	3.005
250,000,000	42.566	36.204	27.137	12.014	10.705	8.576
500,000,000	99.040	85.443	64.359	28.805	24.123	19.747
1,000,000,000	228.863	198.405	151.605	63.898	55.264	42.727
2,500,000,000				187.882	148.423	123.523
5,000,000,000				375.130	327.776	247.705
10,000,000,000				794.573	709.606	526.212
Credit:

Mainstream Desktops:

Processor(s):	Ryzen 5 7600	Core i9 11700K	Ryzen 9 3950X	Ryzen 9 5950X	Intel Core i9 13900KS	Ryzen 9 7950X
Generation:	AMD Zen 4	Intel Rocket Lake	AMD Zen 2	AMD Zen 3	Intel Raptor Lake	AMD Zen 4
Cores/Threads:	6/12	8/16	16/32	16/32	24/32	16/32
Processor Speed:		stock	stock	stock	5.7/4.5 GHz	stock
Memory:	32 GB	32 GB - 3200 MT/s	128 GB - 2666 MT/s	64 GB - 3200 MT/s	96 GB - 8000 MT/s	128 GB - 4400 MT/s	128 GB - 5200 MT/s
Program Version:	v0.8.1 (22-ZN4)	v0.8.1 (18-CNL)	v0.8.1 (19-ZN2)	v0.8.1 (19-ZN2)	v0.8.1 (14-BDW)	v0.8.1 (22-ZN4)
Instruction Set:	x64 AVX512-GFNI	x64 AVX512-VBMI	x64 AVX2 + ADX	x64 AVX2 + ADX	x64 AVX2 + ADX	x64 AVX512-GFNI
25,000,000	0.439	0.501	0.588	0.490	0.241	0.312	0.307
50,000,000		1.114	1.257	1.090	0.525	0.679	0.654
100,000,000		2.223	2.685	2.345	1.132	1.517	1.410
250,000,000		6.220	7.251	6.371	3.185	4.157	3.820
500,000,000	13.378	13.573	15.556	13.395	7.065	8.883	8.062
1,000,000,000	29.497	30.415	33.925	29.301	15.901	18.542	17.039
2,500,000,000	83.421	86.119	96.695	82.204	44.888	50.743	46.467
5,000,000,000	181.647	193.718	215.333	181.355	99.566	110.379	101.345
10,000,000,000			473.958	399.012		241.162	220.522
25,000,000,000			1,361.732			680.344	623.493
Credit:	Joel Rufin	Oliver Kruse		Oliver Kruse	曾铮

Processor(s):	Core i7 920	FX-8350	Core i7 4770K	Ryzen 7 1800X	Ryzen 7 3800X
Generation:	Intel Nehalem	AMD Piledriver	Intel Haswell	AMD Zen 1	AMD Zen 2
Cores/Threads:	4/8	8/8	4/8	8/16	8/16
Processor Speed:	3.5 GHz	stock	4.0 GHz	stock	stock
Memory:	12 GB - 1333 MT/s	32 GB - 1600 MT/s	32 GB - 2133 MT/s	64 GB - 2866 MT/s	32 GB - 3600 MT/s
Program Version:	v0.8.1 (08-NHM)	v0.8.1 (11-BD1)	v0.8.1 (13-HSW)	v0.8.1 (17-ZN1)	v0.8.1 (19-ZN2)
Instruction Set:	x64 SSE4.1	x64 FMA4	x64 AVX2	x64 AVX2 + ADX	x64 AVX2 + ADX
25,000,000	7.032	3.677	1.546	1.150	0.654
50,000,000	17.174	7.703	3.259	2.527	1.415
100,000,000	36.164	16.576	6.987	5.555	3.028
250,000,000	105.789	46.597	19.588	15.760	8.404
500,000,000	236.096	103.165	43.197	34.659	18.440
1,000,000,000	531.676	230.780	96.845	78.690	41.097
2,500,000,000		669.594	274.336	220.278	117.788
5,000,000,000		1,460.714	606.605	493.388	266.719
10,000,000,000				1,078.187
25,000,000,000
Credit:					Oliver Kruse

High-End Desktops:

Processor(s):	Core i7 5960X	Threadripper 1950X	Core i9 7900X	Core i9 7940X	Threadripper 3990X	Xeon W7-2495X	Xeon W9-3475X
Generation:	Intel Haswell	AMD Zen 1	Intel Skylake X	Intel Skylake X	AMD Zen 2	Intel Sapphire Rapids	Intel Sapphire Rapids
Cores/Threads:	8/16	16/32	10/20	14/28	64/128	24/48	36/72
Processor Speed:	4.0 GHz	stock	~3.6 GHz (200W PL)	3.6 GHz (AVX512)	2.9 GHz	4.1-4.9 GHz	4.2-4.9 GHz
Memory:	64 GB - 2400 MT/s	64 GB - 2800 MT/s	128 GB - 3000 MT/s	128 GB - 3466 MT/s	~141 GB - 2666 MT/s	64 GB - 6400 MT/s	128 GB - 6400 MT/s
Program Version:	v0.8.1 (13-HSW)	v0.8.1 (17-ZN1)	v0.8.1 (17-SKX)	v0.8.1 (17-SKX)	v0.8.1 (19-ZN2)	v0.8.1 (18-CNL)	v0.8.3 (18-CNL)
Instruction Set:	x64 AVX2	x64 AVX2 + ADX	x64 AVX512-DQ	x64 AVX512-DQ	x64 AVX2 + ADX	x64 AVX512-VBMI	x64 AVX512-VBMI
25,000,000	0.807	0.756	0.522	0.404	0.584	0.170	0.201
50,000,000	1.743	1.579	1.028	0.721	1.181	0.340	0.321
100,000,000	3.647	3.273	2.048	1.451	2.409	0.726	0.586
250,000,000	10.088	8.990	5.752	4.056	5.724	2.068	1.413
500,000,000	22.075	19.604	12.830	9.017	10.881	4.588	2.627
1,000,000,000	49.232	43.014	28.906	20.518	21.496	10.190	5.924
2,500,000,000	139.404	121.645	82.764	60.636	58.009	28.881	16.345
5,000,000,000	311.388	271.983	186.233	137.906	126.513	64.158	36.139
10,000,000,000	669.736	613.450	401.820	302.121	274.050	124.826	78.816
25,000,000,000			1,125.775	843.498	768.212		225.482
Credit:		Oliver Kruse			Paul Underwood	曾铮

Multi-Processor Workstation/Servers:

Due to high core count and the effect of NUMA (Non-Uniform Memory Access), performance on multi-processor systems are extremely sensitive to various settings. Therefore, these benchmarks may not be entirely representative of what the hardware is capable of.

Processor(s):	Xeon Platinum 8375C (AWS x2iedn.32xlarge)	Xeon Platinum 8488C (AWS m7i.48xlarge)	Epyc 9R14 (AWS m7a.48xlarge)	Epyc 9R14 (AWS hpc7a.96xlarge)	Epyc 9754
Generation:	Intel Sapphire Rapids	Intel Sapphire Rapids	AMD Genoa		AMD Bergamo
Cores/Threads:	64/128	96/192	192/192		128/256	128/128
Processor Speed:	2.9 GHz	2.4 GHz	2.6 GHz		2.25 - 3.1 GHz
Memory:	4 TB	744 GB	740 GB		768 GB - 4800 MT/s
Program Version:	v0.8.1 (18-CNL)	v0.8.1 (18-CNL)	v0.8.1 (22-ZN4)		v0.8.1 (22-ZN4)
Instruction Set:	x64 AVX512-VBMI	x64 AVX512-VBMI	x64 AVX512-GFNI		x64 AVX512-GFNI
25,000,000	0.250	0.163	0.216	0.213	0.245	0.229
50,000,000	0.454	0.289	0.285	0.279	0.350	0.433
100,000,000	0.844	0.531	0.642	0.635	0.853	0.876
250,000,000	1.976	1.288	1.776	1.716	2.224	2.133
500,000,000	3.794	2.499	3.728	3.621	4.186	3.850
1,000,000,000	7.650	5.149	6.547	6.265	7.063	6.495
2,500,000,000	20.425	13.633	13.554	12.500	15.338	14.477
5,000,000,000	45.675	29.655	25.334	22.377	29.072	28.133
10,000,000,000	101.468	64.026	51.134	44.059	58.797	59.007
25,000,000,000	297.622	182.920	140.286	120.282	156.797	164.281
50,000,000,000	678.016	410.842	321.970	275.297	350.391	368.548
100,000,000,000	1,549.991	943.182	771.266	672.558	829.957	853.717
250,000,000,000	4,488.317
500,000,000,000	9,685.971
Credit:	Greg Hogan				Tim Wesley

Processor(s):	Xeon Platinum 8124M	Xeon Gold 6148	Xeon Platinum 8175M	Xeon Platinum 8275CL	Epyc 7742	Epyc 7B12	Epyc 7742
Generation:	Intel Skylake Purley	Intel Skylake Purley	Intel Skylake Purley	Intel Cascade Lake	AMD Rome	AMD Rome	AMD Rome
Sockets/Cores/Threads:	2/36/72	2/40/40	2/48/96	2/48/96	2/128/256	2/112/224	2/128/256
Processor Speed:	3.0 GHz	2.4 GHz	2.5 GHz	3.0 GHz		2.25 GHz	2.25 GHz
Memory:	137 GB - ??	188 GB - ??	~756 GB - ??	192 GB	~504 GB	~882 GB	2 TB
Program Version:	v0.7.5 (17-SKX)	v0.7.6 (17-SKX)	v0.7.6 (17-SKX)	v0.7.8 (17-SKX)	v0.7.7 (17-ZN1)	v0.7.8 (19-ZN2)	v0.7.8 (19-ZN2)
Instruction Set:	x64 AVX512-DQ	x64 AVX512-DQ	x64 AVX512-DQ	x64 AVX512-DQ	x64 AVX2 + ADX	x64 AVX2 + ADX	x64 AVX2 + ADX
25,000,000	0.540	0.329	0.294	0.283	0.534	0.439	0.513
50,000,000	0.981	0.683	0.617	0.544	1.027	0.838	0.920
100,000,000	1.905	1.456	1.305	1.169	2.298	1.796	1.887
250,000,000	5.085	3.737	3.591	3.125	5.854	4.509	4.650
500,000,000	10.372	7.750	7.293	6.309	10.502	8.196	8.066
1,000,000,000	21.217	16.550	15.041	13.042	17.836	14.252	13.246
2,500,000,000	55.701	45.693	39.329	34.028	35.485	30.592	27.011
5,000,000,000	118.151	99.078	83.601	71.777	62.432	58.405	49.940
10,000,000,000	247.928	212.984	176.695	153.169	115.543	116.900	98.156
25,000,000,000		599.653	491.988	425.442	307.995	314.907	258.081
50,000,000,000			1,081.181		690.662	741.633	598.716
100,000,000,000						1715.123	1,370.714
250,000,000,000							3,872.397
Credit:	Jacob Coleman	Oliver Kruse	newalex	Xinyu Miao	Carsten Spille	Greg Hogan	Song Pengei

Processor(s):	Xeon E5-2683 v3	Xeon E7-8880 v3	Xeon E5-2687W v4	Xeon E5-2686 v4	Xeon E5-2696 v4	Epyc 7601	Xeon Gold 6130F
Generation:	Intel Haswell	Intel Haswell	Intel Broadwell	Intel Broadwell	Intel Broadwell	AMD Naples	Intel Skylake Purley
Sockets/Cores/Threads:	2/28/56	4/64/128	2/24/48	2/36/72	2/44/88	2/64/128	2/32/64
Processor Speed:	2.03 GHz	2.3 GHz	3.0 GHz	2.3 GHz	2.2 GHz	2.2 GHz	2.1 GHz
Memory:	128 GB - ???	2 TB - ???	64 GB	504 GB - ???	768 GB - ???	256 GB - ??	256 GB - ??
Program Version:	v0.6.9 (13-HSW)	v0.7.1 (13-HSW)	v0.7.6 (14-BDW)	v0.7.7 (14-BDW)	v0.7.1 (14-BDW)	v0.7.3 (17-ZN1)	v0.7.3 (17-SKX)
Instruction Set:	x64 AVX2	x64 AVX2	x64 AVX2 + ADX	x64 AVX2 + ADX	x64 AVX2 + ADX	x64 AVX2 + ADX	x64 AVX512-DQ
25,000,000	0.907	1.176	0.490	0.494	0.715	2.459	1.150
50,000,000	1.745	2.321	1.072	0.982	1.344	4.347	1.883
100,000,000	3.317	4.217	2.303	2.193	2.673	6.996	3.341
250,000,000	8.339	8.781	6.196	6.044	6.853	14.258	7.731
500,000,000	17.708	15.879	13.046	12.582	14.538	24.930	15.346
1,000,000,000	37.311	32.078	27.763	26.852	31.260	47.837	31.301
2,500,000,000	102.131	78.251	76.202	73.596	84.271	111.139	82.871
5,000,000,000	218.917	164.157	165.046	160.094	192.889	228.252	179.488
10,000,000,000	471.802	346.307	356.487	346.305	417.322	482.777	387.530
25,000,000,000	1,511.852	957.966	1,006.131	980.784	1,186.881	1,184.144	1,063.850
50,000,000,000		2,096.169	2,202.558	2,156.854	2,601.476
100,000,000,000		4,442.742			6,037.704
250,000,000,000		17,428.450
Credit:	Shigeru Kondo	Jacob Coleman	Cameron Giesbrecht	newalex	"yoyo"	Dave Graham

Fastest Times:

The full chart of rankings for each size can be found here:

These fastest times may include unreleased betas.

Got a faster time? Let me know: a-yee@u.northwestern.edu

Note that I usually do not respond to these emails. I simply put them into the charts which I update periodically (typically within 2 weeks).

Performance Tips:

Decimal Digits of Pi - Times in Seconds Core i9 7940X @ 3.7 GHz AVX512
Memory Frequency:	2666 MT/s	3466 MT/s
25,000,000	0.839	0.758
50,000,000	1.424	1.338
100,000,000	2.701	2.425
250,000,000	6.489	5.877
500,000,000	13.307	11.917
1,000,000,000	27.913	24.915
2,500,000,000	76.837	68.322
5,000,000,000	168.058	148.737
10,000,000,000	365.047	322.115
25,000,000,000	1,037.527	916.039

High core count Skylake X processors are known to be heavily bottlenecked by memory bandwidth.

Memory Bandwidth:

Because of the memory-intensive nature of computing Pi and other constants, y-cruncher needs a lot of memory bandwidth to perform well. In fact, the program has been noticably memory bound on nearly all high-end desktops since 2012 as well as the majority of multi-socket systems since at least 2006.

Recommendations:

Make sure all memory channels are populated. This is by far the most important since bandwidth scales almost linearly with the # of channels.
Run your memory at as high a frequency as possible to maximize bandwidth.
Memory timings are usually less important. Long memory latencies are hidden away fairly well by Hyperthreading.
On Skylake X processors, L3 cache bandwidth is also a bottleneck. So overclock the cache as much as possible.

Don't be surprised if y-cruncher exposes instabilities that other applications and stress-tests do not. y-cruncher is unusual in that it simultaneously places a heavy load on both the CPU and the entire memory subsystem.

Parallel Performance:

y-cruncher has a lot of settings for tuning parallel performance. By default, it makes a best effort to analyze the hardware and pick the best settings. But because of the virtually unlimited combinations of processor topologies, it's difficult for y-cruncher to optimally pick the best settings for everything. So sometimes the best performance can only be achieved with manual settings.

Try both the Push Pool and Cilk Plus frameworks. While the Push Pool is faster in most cases, Cilk Plus may be better for extremely small computations as well as on machines with many (> 64) cores.*
Experiment with larger task decomposition sizes. This may alleviate problems with load-imbalance.*
On Windows, if the system has more than 64 logical cores, make sure node-interleaving is disabled in the BIOS. Otherwise, it would lead to imbalanced processor groups which will lead to load-imbalance.

*These are advanced settings that cannot be changed if you're using the benchmark option in the console UI. To change them, you will need to either run benchmark mode from the command line or use the custom compute menu.

Load imbalance is a faily common problem in y-cruncher. The usual causes are:

The number of logical cores is not a power-of-two.
The cores are not homogenous. Common reasons include:
- The cores are clocked at different speeds.
- The cores have access to different amounts of memory bandwidth due an imbalanced NUMA topology.
- The cores are different generation cores hidden behind a virtual machine.
CPU-intensive background processes are interfering with y-cruncher's ability to use all the hardware. This applies to all forms of system jitter.

Large Pages:

Large pages used to not matter in the past, but they do now in the post-Spectre/Meltdown world. Mitigations for the Meltdown vulnerability can have a noticeable performance drop for y-cruncher (up to 5% has been observed). It turns out that turning on large pages can mitigate the penalty for this mitigation. (pun intended)

Refer to the memory allocation guide on how to turn on large pages.

Swap Mode:

This is probably one of the most complicated features in y-cruncher.

Read the guide so you know how to use it.
Depending on the CPU capability of your system, chances are you will either need multiple NVMe SSDs or many hard drives to avoid bottlenecking on disk I/O.
Don't use hardware or software RAID. y-cruncher usually does a better job if you let it manage each drive separately.
Don't use SSDs if you care about their lifespan. y-cruncher can and will destroy SSDs if you sustain it long enough.

Known Issues:

Everything in this section is in the process of being re-verified and moved to: https://github.com/Mysticial/y-cruncher/issues

Performance Issues:

Swap computations on the latest Ubuntu (15.10) and possibly everything else with the same kernel version have very poor performance in swap mode. This is because the OS does excessive and unnecessary disk swapping to the pagefile. The solution is to disable the swap file so that no paging is possible. It may also suffice to set the "swappiness" value to zero. y-cruncher will also attempt to lock pages in memory to prevent the OS from shooting itself with paging.

Algorithms and Developments:

FAQ:

Pi and other Constants:

Program Usage:

Hardware and Overclocking:

Academia:

Programming:

Other:

What about support for other platforms? Mac, ARM, etc...

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.

You can also find me on Twitter as @Mysticial.

OS	Download Link	Size
Windows	y-cruncher v0.8.4.9538a.zip	35.0 MB
Linux (Static)	y-cruncher v0.8.4.9538-static.tar.xz	26.7 MB
Linux (Dynamic)	y-cruncher v0.8.4.9538-dynamic.tar.xz	19.0 MB

y-cruncher - A Multi-Threaded Pi-Program From a high-school project that went a little too far... By Alexander J. Yee

y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

Sample Screenshot: 1 trillion digits of Pi

y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee