Processor-Specific Optimizations

By Alexander Yee

(Last updated: August 3, 2019)

Back To:

In terms of breaking size records, micro-optimizations and processor-specific optimizations are less important. That's because large computations are almost always bottlenecked by disk access. The 12.1 and 13.3 trillion digit computations of Pi both had a CPU utilization of less than 40% due to the disk bottleneck.

Nevertheless, one of the goals of y-cruncher is to be as fast as possible for the purposes of benchmarking and stress-testing.

y-cruncher does processor-specific optimizations in two ways:

Use of processor-specific instruction set extensions.
Tuning parameters that are optimized specifically for a particular environment.

Processor-specific instructions is the obvious one. Most of these involve vector instructions (SSE/AVX). But there are others as well.

The second one is processor-specific tuning. These are more subtle and involve stuff that are more architectural.

Targeted Processor Lines

As of v0.7.8, y-cruncher targets the following processor lines with specially optimized binaries:

Binary	Target Processor(s)	Instruction Set Requirements	Tuned On	Notes
18-CNL ~ Shinoa	Intel Cannon Lake	x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2, AVX512-(F/CD/VL/BW/DQ/IFMA/VBMI)	Core i3 8121U @ 2.3 - 3.2 GHz 8 GB LPDDR4 @ 2400 MHz	Will likely be retuned for Ice Lake in the future.
17-SKX ~ Kotori	Intel Skylake Purley	x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2, AVX512-(F/CD/VL/BW/DQ)	Core i9 7940X @ 4.7/4.0/3.7/2.8 GHz (non-AVX/AVX/AVX512/cache) 128 GB DDR4 @ 3466 MHz
17-ZN1 ~ Yukina	AMD Zen	x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2	Ryzen 7 1800X @ 3.8 GHz 64 GB DDR4 @ 2133 MHz
16-KNL	Intel Knights Landing	x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2, AVX512-(F/CD)	Untuned	Disabled since v0.7.8 due to EOL for Xeon Phi.
14-BDW ~ Kurumi	Intel Broadwell	x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2	Core i7 6820HK @ 3.2 GHz 48 GB DDR4 @ 2133 MHz
13-HSW ~ Airi	Intel Haswell	x64, ABM, BMI1, BMI2, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2	Core i7 5960X @ 4.0 GHz 64 GB DDR4 @ 2133 MHz
11-BD1 ~ Miyu	AMD Bulldozer	x64, ABM, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA4, XOP	FX-8350 @ 4.0 GHz 32 GB DDR3 @ 1333 MHz	Uses 128-bit AVX since the entire processor line cannot efficiently run 256-bit AVX.
11-SNB ~ Hina	Intel Sandy Bridge	x64, SSE(1, 2, 3, S3, 4.1, 4.2), AVX	Core i7 3630QM @ 3.2 GHz 8 GB DDR3 @ 1334 MHz
08-NHM ~ Ushio	Intel Nehalem	x64, SSE, SSE2, SSE3, SSSE3, SSE4.1	Core i7 920 @ 3.5 GHz 12 GB DDR3 @ 1333 MHz	SSE4.2 doesn't seem to help on Nehalem like it does on later processors.
07-PNR ~ Nagisa	Intel Core 2 Penryn	x64, SSE, SSE2, SSE3, SSSE3, SSE4.1	2 x Xeon X5482 @ 3.2 GHz 64 GB DDR2 @ 800 MHz	Disabled since v0.6.5. But still maintained.
05-A64 ~ Kasumi	AMD Athlon 64, K8, K10	x64, SSE, SSE2, SSE3	Phenom II X4 @ 2.8 GHz 12 GB DDR3 @ 1333 MHz	Legacy binaries.
04-P4P	Intel Pentium 4 Prescott	SSE, SSE2, SSE3
00-x86	-	none

Each binary has two parts in its name:

The first part (e.g. "11-SNB") is the required processor capability.
The second part (e.g. "Hina") is the network name of the computer on which the binary has been tuned for. Since there are a lot of computers in y-cruncher's development laboratory, naming them is necessary to distinguish them from each other on the network.

These binaries reside in the "Binaries" folder and no attempt is made to hide them. Users are free and encouraged to run the binaries directly to see the effects of the various instruction sets. For everyone else who wants to keep things simple, the "y-cruncher" binary is a dispatcher that auto-selects the best binary to run.

Dispatch Algorithm:

When the user runs the dispatcher, it looks at the CPU vendor string to choose a list. Then it runs down the list in the following order until it finds one that can run.

AMD Processor	Non-AMD Processor
17-CNL ~ Shinoa	17-CNL ~ Shinoa
17-SKX ~ Kotori	17-SKX ~ Kotori
16-KNL	16-KNL
17-ZD1 ~ Yukina	14-BDW ~ Kurumi
11-BD1 ~ Miyu	13-HSW ~ Airi
08-NHM ~ Ushio	11-SNB ~ Hina
05-A64 ~ Kasumi	08-NHM ~ Ushio
04-P4P	05-A64 ~ Kasumi
00-x86	04-P4P
	00-x86

The following restrictions also apply:

Binaries that are not enabled are skipped.
Core 2 processors with SSE4.1 will always pick "07-PNR ~ Nagisa" if it is available. (This binary is currently disabled in public builds.)

The dispatcher will give warnings for the following situations:

The processor supports AVX or AVX512, but the operating system hasn't enabled it.
The processor supports 64-bit and SSE3, but the operating system is 32-bit.
The processor does not support SSE3.

Instruction Set Extensions

As of v0.7.8, y-cruncher (explicitly) uses the following instruction sets:

Miscellaneous: x64, BMI2, ADX
128-bit SIMD: SSE, SSE2, SSE3, SSE4.1, SSE4.2
256-bit SIMD: AVX, XOP, FMA3, FMA4, AVX2
512-bit SIMD: AVX512-(F/VL/BW/DQ/IFMA/VBMI)

By "explicitly", we mean done using either compiler intrinsics or inline assembly. The actual instruction set requirement will be broader since the compiler itself will often generate other instructions that are implied by targeting a specific processor.

For example, y-cruncher doesn't use ABM instructions, but GCC generates them anyway when targeting Haswell or later. And while y-cruncher doesn't use BMI1 and AVX512-CD, the Intel Compiler generates them anyway when targeting Haswell and Knights Landing respectively.

Miscellaneous Instructions:

The 64-bit -> 128-bit multiply and add-with-carry instructions are all extremely useful for large multiplication. But since there's no standard C/C++ construct that will generate them, they must be generated either using compiler intrinsics or inline assembly.

Binaries that target Haswell or later will use the MULX instruction. Binaries that target Broadwell or later also use the ADCX and ADOX instructions. However, both instructions are disabled on Knights Landing since they are slow.

Vector Instructions:

Arbitrary precision arithmetic has historically been difficult to vectorize due to carry-propagation. But at large enough sizes, it's an entirely different game since most of the larger algorithms are either directly vectorizable, or can be made vectorizable with the right hacks.

The Floating-Point FFT is almost entirely vectorizable double-precision floating-point. So SSE2/3 and AVX are very useful here. FMA3/4 instructions are natural freebees to the FFT and almost anything else that's floating-point heavy.
Number-Theoretic Transforms are significantly harder to vectorize. At minimum, SSE4.1 and SSE4.2 are needed to be competitive with an optimized scalar implementation. Either XOP or AVX2 will tip the balance heavily in favor of vectorization.
The Digit Viewer has a bunch of hacks that utilize SSE4.1 with trivial widenings to AVX2 and AVX512.
The original SSE consists only of single-precision stuff which is useless to y-cruncher. But it has a very powerful 32-bit granular shuffle which is very useful for performing data transpositions. This SSE shuffle instruction will be completely superceded by AVX512's even more powerful permutes.

We won't try to enumerate all the places where y-cruncher explicitly vectorizes. But there's a lot of them and they are all done manually with intrinsics and occasionally a bit of inline assembly. For the most part, everything performance-critical that can be vectorized has been done so manually. Very little is left to the compiler.

With the large number of target processors with varying sets of vector instructions, it may seem tedious to manually vectorize all of it. But there are plenty of tricks to abstract out most of the work without sacrificing performance.

In other words, y-cruncher does not have 10 different implementations for everything. One well-written implementation will suffice for all vector sizes from none (scalar) all the way to AVX512 and beyond. Furthermore, it will also handle all the various flavors of instructions within each vector size.

Unfortunately, these "vector-scalable" approaches do not extend all the way to GPUs since those require a completely different programming paradigm.

Tuning Parameters

Tuning parameters cover everything else that isn't a processor-specific instruction. These are all things that affect performance, but do not affect the ability to run on other processors.

A partial list of these include:

Cache sizes
Loop unrolling factors
Block sizes
Algorithm selection
Algorithm parameter tables

While some parameters are tuned manually, most are done automatically using a classic superoptimizer.

For each given task, the superoptimizer benchmarks all available algorithms/configurations and puts the best one into a lookup table. This generates a table of "optimal" configurations. Due to the exhaustive nature of superoptimization, it also serves as a very thorough integration testing platform.

Since superoptimizer tuning is done with real benchmarks on an actual system, it takes into account everything - including stuff that are specific to the system rather than the processor line. As a result, the tuning fundamentally biases the results for the specific system it is run on. For the most part, superoptimizer tuning is preferably done on properly configured high-end desktop systems with as much memory as possible.

This tuning can have a significant effect on performance. And is the reason why the SSE4.1 binaries (which are tuned for Intel processors) run slower than the SSE3 binary on AMD Bulldozer processors.

y-cruncher has always had some form of superoptimization. But it didn't take it seriously until v0.4.3. Back in these early days of y-cruncher, it was often possible to squeeze out more than 10% performance over an untuned (or improperly tuned) configuration. Recent versions of y-cruncher don't gain nearly as much.

Why isn't the superoptimizer exposed to the user?

It's not that simple:

The superoptimizer isn't fully automated. There are several manual steps that require multiple (compile -> tune -> recompile) phases in a manner that mimics profiled-guided optimization.
Many of the parameters are compile-time constants.
Due to the large search space, a full tuning of all parameters will typically require over a week to run.

With rare exceptions, the superoptimizer produces nearly identical results for processors within the same generation. So it's sufficient to go "one size fits all" within each processor line. And of course, it keeps things a lot simpler as well.

Performance Matrix

All times in seconds.

Version v0.7.6:

y-cruncher v0.7.6.9483

x64

Pi - 1 billion digits

SSE3

SSE4.1

AVX

XOP

AVX2

ADX

AVX512

VBMI

Processor

Generation

Cores/Threads

CPU Frequency

Kasumi

Ushio

Hina

Miyu

Airi

Kurumi

Yukina

Kotori

Phenom II X4

AMD K10

4 / 4

2.8 GHz

648.422

Core i7 920

Intel Nehalem

4 / 8

3.5 GHz

Core i7 3630QM

Intel Ivy Bridge

4 / 8

3.2 GHz

401.458

357.863

314.457

FX-8350

AMD Piledriver

8 / 8

4.0 GHz

337.487

316.194

375.843

219.344

Core i7 4770K

Intel Haswell

4 / 8

4.0 GHz

285.843

263.904

233.302

102.451

Core i7 6820HK

Intel Skylake

4 / 8

3.2 GHz

325.628

283.116

245.179

108.533

108.108

122.804

Ryzen 7 1800X

AMD Zen

8 / 16

3.7 GHz

133.765

123.702

127.374

93.916

89.332

86.323

79.345

Core i9 7900X

Intel Skylake X

10 / 20

4.3/4.0/3.6 GHz*

104.689

92.045

81.732

38.586

38.359

42.373

29.687

28.827

Core i3 8121U

Intel Cannon Lake

2 / 4

1.8 - 2.2 GHz**

*Skylake X processors run at different frequencies for different work-loads. These refer to the non-AVX/AVX/AVX512 frequencies.

**Variable frequency due to power throttling.

Version v0.7.3:

y-cruncher v0.7.3.9475			x64
Pi - 1 billion digits			SSE3	SSE4.1	AVX	XOP	AVX2	ADX		AVX512-CD	AVX512-DQ
Processor			Kasumi	Ushio	Hina	Miyu	Airi	Kurumi	Yukina		Kotori
Phenom II X4	AMD K10	2.8 GHz
2x Xeon X5482	Intel Penryn	3.2 GHz
Core i7 920	Intel Nehalem	3.5 GHz
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	394.581	343.320	299.276
FX-8350	AMD Piledriver	4.0 GHz	350.312	348.505	402.221	245.184
Core i7 4770K	Intel Haswell	4.0 GHz	297.662	271.819	237.461		111.004
Core i7 6820HK	Intel Skylake	3.2 GHz	340.796	288.924	246.622		117.228	116.774	132.139
Ryzen 7 1800X	AMD Zen	3.7 GHz	151.941	142.441	143.236	107.959	104.811	102.943	99.072
Core i9 7900X	Intel Skylake X	4.5/4.0/3.8 GHz*	104.604	90.433	80.518		43.970	43.693	48.271	36.495	36.287

*Skylake Purley processors run at different frequencies for different work-loads. These refer to the non-AVX/AVX/AVX512 frequencies.

Version v0.7.2:

y-cruncher v0.7.2.9467*			x86		x64
Pi - 250 million digits			-	SSE3	SSE3	SSE4.1	SSE4.1	AVX	XOP	AVX2	ADX
Processor					Kasumi	Nagisa	Ushio	Hina	Miyu	Airi	Kurumi	Yukina
Core 2 Quad Q6600	Intel Core	2.4 GHz	308.054	226.110	157.537
2x Xeon X5482	Intel Penryn	3.2 GHz	127.004	96.231	65.273	59.556	60.049
Core i7 920	Intel Nehalem	3.5 GHz	168.209	113.770	83.253	74.924	74.230
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	153.591	97.507	77.977	65.857	64.575	55.089
FX-8350	AMD Piledriver	4.0 GHz	179.998	101.482	70.333	65.984	70.011	79.185	50.004
Core i7 4770K	Intel Haswell	4.0 GHz	119.946	73.454	58.086	53.083	52.502	45.654		22.695
Core i7 6820HK	Intel Skylake	3.2 GHz	138.649	81.961	66.701	55.539	55.298	47.184		23.715	23.557
Ryzen 7 1800X	AMD Zen	3.7 GHz			29.709	27.519	27.027	26.755	21.760	20.824	20.618	20.600

y-cruncher v0.7.2.9467*			x64
Pi - 1 billion digits			SSE3	SSE4.1	SSE4.1	AVX	XOP	AVX2	ADX
Processor			Kasumi	Nagisa	Ushio	Hina	Miyu	Airi	Kurumi	Yukina
Core 2 Quad Q6600	Intel Core	2.4 GHz	801.731
2x Xeon X5482	Intel Penryn	3.2 GHz	328.268	303.304	308.679
Core i7 920	Intel Nehalem	3.5 GHz	422.875	386.210	381.903
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	410.495	348.969	348.743	299.217
FX-8350	AMD Piledriver	4.0 GHz	352.722	332.738	346.144	406.305	249.087
Core i7 4770K	Intel Haswell	4.0 GHz	293.014	272.962	268.388	234.167		111.593
Core i7 6820HK	Intel Skylake	3.2 GHz	334.388	288.032	285.588	243.899		116.398	115.661	131.439
Ryzen 7 1800X	AMD Zen	3.7 GHz	149.560	139.655	138.219	140.686	105.836	102.008	100.108	95.852

Version 0.7.2.9467 is not entirely speed consistent with the current release of v0.7.2.9468 due to a number of post-feature-freeze changes. But the performance fluctuations should be negligible enough that these benchmarks are representative of v0.7.2.9468.

Version v0.7.1:

y-cruncher v0.7.1.9465			x86		x64
Pi - 250 million digits			-	SSE3	SSE3	SSE4.1	AVX	XOP	AVX2	ADX
Processor					Kasumi	Ushio	Hina	Miyu	Airi	Kurumi
Core 2 Quad Q6600	Intel Core	2.4 GHz	308.968	222.807	156.763
Core i7 920	Intel Nehalem	3.5 GHz	168.149	113.313	83.244	80.000
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	153.950	98.872	76.648	73.044	56.130
FX-8350	AMD Piledriver	4.0 GHz	181.973	105.597	70.267	69.988	72.020	54.761
Core i7 4770K	Intel Haswell	4.0 GHz	118.395	73.312	58.699	55.169	46.588		24.227
Core i7 6820HK	Intel Skylake	3.2 GHz	138.380	83.017	67.255	62.939	48.250		25.059	24.877

y-cruncher v0.7.1.9465			x64
Pi - 1 billion digits			SSE3	SSE4.1	AVX	XOP	AVX2	ADX
Processor			Kasumi	Ushio	Hina	Miyu	Airi	Kurumi
Core 2 Quad Q6600	Intel Core	2.4 GHz	803.675
Core i7 920	Intel Nehalem	3.5 GHz	424.546	406.073
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	404.973	393.676	313.130
FX-8350	AMD Piledriver	4.0 GHz	353.717	351.583	376.275	266.824
Core i7 4770K	Intel Haswell	4.0 GHz	297.314	279.921	239.519		119.836
Core i7 6820HK	Intel Skylake	3.2 GHz	338.069	319.290	248.265		122.021	121.586

Version v0.6.9:

y-cruncher v0.6.9.9462			x86		x64
Pi - 250 million digits			-	SSE3	SSE3	SSE4.1	SSE4.1	AVX	XOP	AVX2
Processor					Kasumi	Nagisa	Ushio	Hina	Miyu	Airi
Core 2 Quad Q6600	Intel Core	2.4 GHz	420.987	242.887	174.021
2x Xeon X5482	Intel Penryn	3.2 GHz	179.118	109.439	72.111	72.011	71.582
Core i7 920	Intel Nehalem	3.5 GHz	177.332	116.777	88.065	84.550	85.637
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	175.253	112.748	85.591	89.339	81.713	72.034
FX-8350	AMD Piledriver	4.0 GHz	217.327	113.313	77.238	79.227	77.273	105.003	57.478
Core i7 4770K	Intel Haswell	4.0 GHz	159.708	92.977	66.082	60.625	60.541	46.573		26.202
Core i7 6820HK	Intel Skylake	3.2 GHz	165.881	97.104	69.067	68.068	68.136	50.341		26.248

y-cruncher v0.6.9.9462			x64
Pi - 1 billion digits			SSE3	SSE4.1	SSE4.1	AVX	XOP	AVX2
Processor			Kasumi	Nagisa	Ushio	Hina	Miyu	Airi
Core 2 Quad Q6600	Intel Core	2.4 GHz	880.448
2x Xeon X5482	Intel Penryn	3.2 GHz	352.635	336.713	347.962
Core i7 920	Intel Nehalem	3.5 GHz	436.454	412.397	424.928
Core i7 3630QM	Intel Ivy Bridge	3.2 GHz	413.265	399.341	407.710	350.849
FX-8350	AMD Piledriver	4.0 GHz	377.213	369.352	374.660	506.823	281.864
Core i7 4770K	Intel Haswell	4.0 GHz	311.231	295.216	300.480	231.361		129.71
Core i7 6820HK	Intel Skylake	3.2 GHz	343.253	335.229	342.091	253.056		130.53

Note that the tuning parameters in v0.6.9 were very out-of-date. That's why the "Nagisa" and "Ushio" binaries don't always perform better on their target processors.