Processor-Specific Optimizations

By Alexander Yee

 

(Last updated: August 31, 2016)

 

 

In terms of breaking size records, micro-optimizations and processor-specific optimizations are less important. That's because large computations are almost always bottlenecked by disk access. The 12.1 and 13.3 trillion digit computations of Pi both had a CPU utilization of less than 40% due to the disk bottleneck.

 

Nevertheless, one of the goals of y-cruncher is to be as fast as possible for the purposes of benchmarking and stress-testing.

 

y-cruncher does processor-specific optimizations in two ways:

  1. Use of processor-specific instruction set extensions.
  2. Tuning parameters that are optimized specifically for a particular environment.

Processor-specific instructions is the obvious one. Most of these involve vector instructions (SSE/AVX). But there are others as well.

The second one is processor-specific tuning. These are more subtle and involve stuff that are more architectural.

 

Targeted Processor Lines

 

As of v0.7.1, y-cruncher targets the following processor lines with specially optimized binaries:

Binary Target Processor(s) Instruction Set Requirements Notes
x64 AVX512-IFMA ~ ??

Intel Cannonlake

x64, ABM, BMI1, BMI2, ADX,

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,

AVX, FMA3, AVX2,

AVX512-(F/CD/VL/BW/DQ/IFMA/VBMI)

 

Not enabled yet.

 

Both are fully working and tested under emulation. But without the hardware, there's nothing to do other than to wait.

x64 AVX512-DQ ~ ??

Intel Skylake Purley

x64, ABM, BMI1, BMI2, ADX,

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,

AVX, FMA3, AVX2,

AVX512-(F/CD/VL/BW/DQ)

x64 AVX512-CD ~ ??

Intel Knights Landing

x64, ABM, BMI1, BMI2, ADX,

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,

AVX, FMA3, AVX2,

AVX512-(F/CD)

This one is more uncertain. Knights Landing has such a drastically different architecture from normal desktop processors that the AVX512 might be irrelevant.

x64 ADX ~ Kurumi

Intel Skylake

x64, ABM, BMI1, BMI2, ADX,

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,

AVX, FMA3, AVX2

Will also run on Intel Broadwell processors.

x64 AVX2 ~ Airi

Intel Haswell

x64, ABM, BMI1, BMI2,

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,

AVX, FMA3, AVX2

Not all Haswell processors support AVX.

x64 XOP ~ Miyu

AMD Bulldozer

x64, ABM,

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,

AVX, FMA4, XOP

 

x64 AVX ~ Hina

Intel Sandy Bridge

x64, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX

Not all Sandy Bridge processors support AVX.

x64 SSE4.1 ~ Ushio

Intel Nehalem

x64, SSE, SSE2, SSE3, SSSE3, SSE4.1

 

x64 SSE4.1 ~ Nagisa

Intel Core 2 Penryn

x64, SSE, SSE2, SSE3, SSSE3, SSE4.1

Disabled since v0.6.5. But still maintained.

x64 SSE3 ~ Kasumi

-

x64, SSE, SSE2, SSE3

Originally targeted AMD K10. But years without access to a suitable test machine have left the tuning parameters uselessly out-of-date.

 

Now the binary is tuned for Core 2 Kentsfield.

x64

Original AMD x64

x64, SSE, SSE2

Discontinued as of v0.4.1.

x86 SSE3

-

SSE, SSE2, SSE3

 

x86

-

none

Disabled in v0.6.1. Re-enabled in v0.7.1.*

Each binary has two parts in its name. The first part (e.g. "x64 AVX") is the required instruction set architecture. The second part (e.g. "Hina") is the name of the computer on which the binary has been tuned for. Since there are a lot of computers, naming them is necessary to distinguish them from each other on the network.

 

The names themselves are generally from various Japanese Anime characters. Once a name is assigned, it will stick forever even if the machine is replaced. For example, "Hina" was destroyed in an accident and was replaced with "Haruka", an Ivy Bridge laptop. But the old name remains.

 

These binaries reside in the "Binaries" folder and no attempt is made to hide them. Users are free and encouraged to run the binaries directly to see the effects of the various instruction sets. For everyone else who wants to keep things simple, the "y-cruncher" binary is a dispatcher that auto-selects the best binary to run.

 

When the user runs the dispatcher, it walks down the list of binaries in the order they are listed above and picks the first one that will run.

The following exceptions are applied:

The dispatcher will also give warnings for the following situations:

 

*The x86 binary was disabled for the entire v0.6.x series for a number of technical reasons related to performance. y-cruncher was almost completely rewritten between v0.5.5 and v0.6.1. During that rewrite, a lot of old code was tossed out - including the original FFT that ran very fast on the x87 FPU. So starting from v0.6.1, the x86 binary used the newer vector-scalable FFT that was designed for SSE3 and AVX. While this vector FFT compiled and ran fine with x87 instructions, it was ultimately slower than the original FFT. In the end, the overall speed of the x86 binary regressed from v0.5.5 to v0.6.1. So the decision was made to simply disable it.

 

Fast forward to v0.7.1, and cumulative effect of 3 years worth of optimizations has finally overcome the v0.5.5 -> v0.6.1 regression. So once again, the x86 binary has been enabled. However, the new x86 binary isn't the same as before. Like the rest of the binaries, Windows Vista is a minimum requirement. So don't expect to run it on any processors that are very old. The main purpose of it is for comparing the effects of the different instruction set extensions.

 

 

Instruction Set Extensions

 

As of v0.7.1, y-cruncher (explicitly) uses the following instruction sets:

By "explicitly", we mean done using either compiler intrinsics or inline assembly. The actual instruction set requirement will be broader since the compiler itself will often generate other instructions that are implied by targeting a specific processor. (y-cruncher doesn't explicitly use ABM or BMI1. But the compiler may generate them anyway when compiling for Haswell.)

 

 

Miscellaneous Instructions:

 

The 64-bit -> 128-bit multiply and add with add-with-carry instructions are all extremely useful for large multiplication. Since there's no standard C/C++ construct that will generate them, they must be generated either using compiler intrinsics or inline assembly.

 

Binaries that target Haswell or later will use the MULX instruction. Binaries that target Broadwell or later also use the ADCX and ADOX instructions.

 

 

Vector Instructions:

 

Arbitrary precision arithmetic has historically been difficult to vectorize due to carry-propagation. But at large enough sizes, it's an entirely different game since most of the larger algorithms are either directly vectorizable, or can be made vectorizable with the right hacks.

We won't try to enumerate all the places where y-cruncher explicitly vectorizes. But there's a lot of them and they are all done manually with intrinsics and occasionally a bit of inline assembly. For the most part, everything performance-critical that can be vectorized has been done so manually. Very little is left to the compiler.

 

With the large number of target processors with varying sets of vector instructions, it may seem tedious to manually vectorize all of it. But there are plenty of tricks to abstract out most of the work without sacrificing performance.

 

In other words, y-cruncher does not have 10 different implementations for everything. One well-written implementation will suffice for all vector sizes from none (scalar) all the way to AVX512 and beyond. Furthermore, it will also handle all the various flavors of instructions within each vector size.

 

Unfortunately, these "vector-scalable" approaches do not extend all the way to GPUs since those require a completely different programming paradigm.

 

 

Tuning Parameters

 

Tuning parameters cover everything else that isn't a processor-specific instruction. These are all things that affect performance, but do not affect the ability to run on other processors.

 

A partial list of these include:

While some parameters are tuned manually, most are done automatically using a classic superoptimizer.

 

For each given task, the superoptimizer benchmarks all available algorithms/configurations and puts the best one into a lookup table. This generates a table of "optimal" configurations. Due to the exhaustive nature of superoptimization, it also serves as a very thorough integration testing platform.

 

Since superoptimizer tuning is done with real benchmarks on an actual system, it takes into account everything - including stuff that are specific to the system rather than the processor line. As a result, the tuning fundamentally biases the results for the specific system it is run on. For the most part, superoptimizer tuning is preferably done on properly configured high-end desktop systems with as much memory as possible.

 

This tuning can have a significant effect on performance. And is the reason why the SSE4.1 binaries (which are tuned for Intel processors) run slower than the SSE3 binary on AMD Bulldozer processors.

 

y-cruncher has always had some form of superoptimization. But it didn't take it seriously until v0.4.3. Back in these early days of y-cruncher, it was often possible to squeeze out more than 10% performance over an untuned (or improperly tuned) configuration. Recent versions of y-cruncher don't gain nearly as much.

 

 

Why isn't the superoptimizer exposed to the user?

 

It's not that simple:

With rare exceptions, the superoptimizer produces nearly identical results for processors within the same generation. So it's sufficient to go "one size fits all" within each processor line. And of course, it keeps things a lot simpler as well.

 

 

 

Performance Matrix

 

All times in seconds.

 

 

Version v0.7.1:

y-cruncher v0.7.1.9465 x86 x64
Pi - 250 million digits - SSE3 SSE3 SSE4.1 AVX XOP AVX2 ADX
Processor     Kasumi Ushio Hina Miyu Airi Kurumi
Core 2 Quad Q6600 Intel Core 2.4 GHz 308.968 222.807 156.763          
Core i7 920 Intel Nehalem 3.5 GHz 168.149 113.313 83.244 80.000        
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 153.950 98.872 76.648 73.044 56.130      
FX-8350 AMD Piledriver 4.0 GHz 181.973 105.597 70.267 69.988 72.020 54.761    
Core i7 4770K Intel Haswell 4.0 GHz 118.395 73.312 58.699 55.169 46.588   24.227  
Core i7 6820HK Intel Skylake 3.2 GHz 138.380 83.017 67.255 62.939 48.250   25.059 24.877
y-cruncher v0.7.1.9465 x64
Pi - 1 billion digits SSE3 SSE4.1 AVX XOP AVX2 ADX
Processor Kasumi Ushio Hina Miyu Airi Kurumi
Core 2 Quad Q6600 Intel Core 2.4 GHz 803.675          
Core i7 920 Intel Nehalem 3.5 GHz 424.546 406.073        
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 404.973 393.676 313.130      
FX-8350 AMD Piledriver 4.0 GHz 353.717 351.583 376.275 266.824    
Core i7 4770K Intel Haswell 4.0 GHz 297.314 279.921 239.519   119.836  
Core i7 6820HK Intel Skylake 3.2 GHz 338.069 319.290 248.265   122.021 121.586

 

 

Version v0.6.9:

y-cruncher v0.6.9.9462 x86 x64
Pi - 250 million digits - SSE3 SSE3 SSE4.1 SSE4.1 AVX XOP AVX2
Processor     Kasumi Nagisa Ushio Hina Miyu Airi
Core 2 Quad Q6600 Intel Core 2.4 GHz 420.987 242.887 174.021          
2x Xeon X5482 Intel Penryn 3.2 GHz 179.118 109.439 72.111 72.011 71.582      
Core i7 920 Intel Nehalem 3.5 GHz 177.332 116.777 88.065 84.550 85.637      
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 175.253 112.748 85.591 89.339 81.713 72.034    
FX-8350 AMD Piledriver 4.0 GHz 217.327 113.313 77.238 79.227 77.273 105.003 57.478  
Core i7 4770K Intel Haswell 4.0 GHz 159.708 92.977 66.082 60.625 60.541 46.573   26.202
Core i7 6820HK Intel Skylake 3.2 GHz 165.881 97.104 69.067 68.068 68.136 50.341   26.248
y-cruncher v0.6.9.9462 x64
Pi - 1 billion digits SSE3 SSE4.1 SSE4.1 AVX XOP AVX2
Processor Kasumi Nagisa Ushio Hina Miyu Airi
Core 2 Quad Q6600 Intel Core 2.4 GHz 880.448          
2x Xeon X5482 Intel Penryn 3.2 GHz 352.635 336.713 347.962      
Core i7 920 Intel Nehalem 3.5 GHz 436.454 412.397 424.928      
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 413.265 399.341 407.710 350.849    
FX-8350 AMD Piledriver 4.0 GHz 377.213 369.352 374.660 506.823 281.864  
Core i7 4770K Intel Haswell 4.0 GHz 311.231 295.216 300.480 231.361   129.71
Core i7 6820HK Intel Skylake 3.2 GHz 343.253 335.229 342.091 253.056   130.53