Processor-Specific Optimizations

By Alexander Yee

 

(Last updated: August 3, 2019)

 

 

Back To:

 

In terms of breaking size records, micro-optimizations and processor-specific optimizations are less important. That's because large computations are almost always bottlenecked by disk access. The 12.1 and 13.3 trillion digit computations of Pi both had a CPU utilization of less than 40% due to the disk bottleneck.

 

Nevertheless, one of the goals of y-cruncher is to be as fast as possible for the purposes of benchmarking and stress-testing.

 

y-cruncher does processor-specific optimizations in two ways:

  1. Use of processor-specific instruction set extensions.
  2. Tuning parameters that are optimized specifically for a particular environment.

Processor-specific instructions is the obvious one. Most of these involve vector instructions (SSE/AVX). But there are others as well.

The second one is processor-specific tuning. These are more subtle and involve stuff that are more architectural.

 

Targeted Processor Lines

 

As of v0.7.8, y-cruncher targets the following processor lines with specially optimized binaries:

Binary Target Processor(s) Instruction Set Requirements Tuned On Notes
18-CNL ~ Shinoa

Intel Cannon Lake

x64, ABM, BMI1, BMI2, ADX,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA3, AVX2,

AVX512-(F/CD/VL/BW/DQ/IFMA/VBMI)

Core i3 8121U @ 2.3 - 3.2 GHz

8 GB LPDDR4 @ 2400 MHz

Will likely be retuned for Ice Lake in the future.

17-SKX ~ Kotori

Intel Skylake Purley

x64, ABM, BMI1, BMI2, ADX,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA3, AVX2,

AVX512-(F/CD/VL/BW/DQ)

Core i9 7940X @

4.7/4.0/3.7/2.8 GHz

(non-AVX/AVX/AVX512/cache)

128 GB DDR4 @ 3466 MHz

 
17-ZN1 ~ Yukina AMD Zen

x64, ABM, BMI1, BMI2, ADX,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA3, AVX2

Ryzen 7 1800X @ 3.8 GHz

64 GB DDR4 @ 2133 MHz

 

16-KNL

Intel Knights Landing

x64, ABM, BMI1, BMI2, ADX,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA3, AVX2,

AVX512-(F/CD)

Untuned Disabled since v0.7.8 due to EOL for Xeon Phi.

14-BDW ~ Kurumi

Intel Broadwell

x64, ABM, BMI1, BMI2, ADX,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA3, AVX2

Core i7 6820HK @ 3.2 GHz

48 GB DDR4 @ 2133 MHz

 

13-HSW ~ Airi

Intel Haswell

x64, ABM, BMI1, BMI2,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA3, AVX2

Core i7 5960X @ 4.0 GHz

64 GB DDR4 @ 2133 MHz

 

11-BD1 ~ Miyu

AMD Bulldozer

x64, ABM,

SSE(1, 2, 3, S3, 4.1, 4.2)

AVX, FMA4, XOP

FX-8350 @ 4.0 GHz

32 GB DDR3 @ 1333 MHz

Uses 128-bit AVX since the entire processor line cannot efficiently run 256-bit AVX.

11-SNB ~ Hina

Intel Sandy Bridge

x64, SSE(1, 2, 3, S3, 4.1, 4.2), AVX

Core i7 3630QM @ 3.2 GHz

8 GB DDR3 @ 1334 MHz

 

08-NHM ~ Ushio

Intel Nehalem

x64, SSE, SSE2, SSE3, SSSE3, SSE4.1

Core i7 920 @ 3.5 GHz

12 GB DDR3 @ 1333 MHz

SSE4.2 doesn't seem to help on Nehalem like it does on later processors.

07-PNR ~ Nagisa

Intel Core 2 Penryn

x64, SSE, SSE2, SSE3, SSSE3, SSE4.1

2 x Xeon X5482 @ 3.2 GHz

64 GB DDR2 @ 800 MHz

Disabled since v0.6.5. But still maintained.

05-A64 ~ Kasumi

AMD Athlon 64, K8, K10

x64, SSE, SSE2, SSE3

Phenom II X4 @ 2.8 GHz

12 GB DDR3 @ 1333 MHz

 

Legacy binaries.

 

04-P4P

Intel Pentium 4 Prescott

SSE, SSE2, SSE3

00-x86

-

none

Each binary has two parts in its name:

These binaries reside in the "Binaries" folder and no attempt is made to hide them. Users are free and encouraged to run the binaries directly to see the effects of the various instruction sets. For everyone else who wants to keep things simple, the "y-cruncher" binary is a dispatcher that auto-selects the best binary to run.

 

 

Dispatch Algorithm:

 

When the user runs the dispatcher, it looks at the CPU vendor string to choose a list. Then it runs down the list in the following order until it finds one that can run.

AMD Processor Non-AMD Processor
17-CNL ~ Shinoa 17-CNL ~ Shinoa
17-SKX ~ Kotori 17-SKX ~ Kotori
16-KNL 16-KNL
17-ZD1 ~ Yukina 14-BDW ~ Kurumi
11-BD1 ~ Miyu 13-HSW ~ Airi
08-NHM ~ Ushio 11-SNB ~ Hina
05-A64 ~ Kasumi 08-NHM ~ Ushio
04-P4P 05-A64 ~ Kasumi
00-x86 04-P4P
  00-x86

The following restrictions also apply:

  1. Binaries that are not enabled are skipped.
  2. Core 2 processors with SSE4.1 will always pick "07-PNR ~ Nagisa" if it is available. (This binary is currently disabled in public builds.)

 

The dispatcher will give warnings for the following situations:

 

 

Instruction Set Extensions

 

As of v0.7.8, y-cruncher (explicitly) uses the following instruction sets:

By "explicitly", we mean done using either compiler intrinsics or inline assembly. The actual instruction set requirement will be broader since the compiler itself will often generate other instructions that are implied by targeting a specific processor.

 

For example, y-cruncher doesn't use ABM instructions, but GCC generates them anyway when targeting Haswell or later. And while y-cruncher doesn't use BMI1 and AVX512-CD, the Intel Compiler generates them anyway when targeting Haswell and Knights Landing respectively.

 

 

Miscellaneous Instructions:

 

The 64-bit -> 128-bit multiply and add-with-carry instructions are all extremely useful for large multiplication. But since there's no standard C/C++ construct that will generate them, they must be generated either using compiler intrinsics or inline assembly.

 

Binaries that target Haswell or later will use the MULX instruction. Binaries that target Broadwell or later also use the ADCX and ADOX instructions. However, both instructions are disabled on Knights Landing since they are slow.

 

 

Vector Instructions:

 

Arbitrary precision arithmetic has historically been difficult to vectorize due to carry-propagation. But at large enough sizes, it's an entirely different game since most of the larger algorithms are either directly vectorizable, or can be made vectorizable with the right hacks.

We won't try to enumerate all the places where y-cruncher explicitly vectorizes. But there's a lot of them and they are all done manually with intrinsics and occasionally a bit of inline assembly. For the most part, everything performance-critical that can be vectorized has been done so manually. Very little is left to the compiler.

 

With the large number of target processors with varying sets of vector instructions, it may seem tedious to manually vectorize all of it. But there are plenty of tricks to abstract out most of the work without sacrificing performance.

 

In other words, y-cruncher does not have 10 different implementations for everything. One well-written implementation will suffice for all vector sizes from none (scalar) all the way to AVX512 and beyond. Furthermore, it will also handle all the various flavors of instructions within each vector size.

 

Unfortunately, these "vector-scalable" approaches do not extend all the way to GPUs since those require a completely different programming paradigm.

 

 

Tuning Parameters

 

Tuning parameters cover everything else that isn't a processor-specific instruction. These are all things that affect performance, but do not affect the ability to run on other processors.

 

A partial list of these include:

While some parameters are tuned manually, most are done automatically using a classic superoptimizer.

 

For each given task, the superoptimizer benchmarks all available algorithms/configurations and puts the best one into a lookup table. This generates a table of "optimal" configurations. Due to the exhaustive nature of superoptimization, it also serves as a very thorough integration testing platform.

 

Since superoptimizer tuning is done with real benchmarks on an actual system, it takes into account everything - including stuff that are specific to the system rather than the processor line. As a result, the tuning fundamentally biases the results for the specific system it is run on. For the most part, superoptimizer tuning is preferably done on properly configured high-end desktop systems with as much memory as possible.

 

This tuning can have a significant effect on performance. And is the reason why the SSE4.1 binaries (which are tuned for Intel processors) run slower than the SSE3 binary on AMD Bulldozer processors.

 

y-cruncher has always had some form of superoptimization. But it didn't take it seriously until v0.4.3. Back in these early days of y-cruncher, it was often possible to squeeze out more than 10% performance over an untuned (or improperly tuned) configuration. Recent versions of y-cruncher don't gain nearly as much.

 

 

Why isn't the superoptimizer exposed to the user?

 

It's not that simple:

With rare exceptions, the superoptimizer produces nearly identical results for processors within the same generation. So it's sufficient to go "one size fits all" within each processor line. And of course, it keeps things a lot simpler as well.

 

 

 

Performance Matrix

 

All times in seconds.

 

Version v0.7.6:

 

y-cruncher v0.7.6.9483 x64
Pi - 1 billion digits SSE3 SSE4.1 AVX XOP AVX2 ADX AVX512
  CD DQ VBMI
Processor Generation Cores/Threads CPU Frequency Kasumi Ushio Hina Miyu Airi Kurumi Yukina   Kotori  
Phenom II X4 AMD K10 4 / 4 2.8 GHz 648.422                  
Core i7 920 Intel Nehalem 4 / 8 3.5 GHz                    
Core i7 3630QM Intel Ivy Bridge 4 / 8 3.2 GHz 401.458 357.863 314.457              
FX-8350 AMD Piledriver 8 / 8 4.0 GHz 337.487 316.194 375.843 219.344            
Core i7 4770K Intel Haswell 4 / 8 4.0 GHz 285.843 263.904 233.302   102.451          
Core i7 6820HK Intel Skylake 4 / 8 3.2 GHz 325.628 283.116 245.179   108.533 108.108 122.804      
Ryzen 7 1800X AMD Zen 8 / 16 3.7 GHz 133.765 123.702 127.374 93.916 89.332 86.323 79.345      
Core i9 7900X Intel Skylake X 10 / 20

4.3/4.0/3.6 GHz*

104.689 92.045 81.732   38.586 38.359 42.373 29.687 28.827  
Core i3 8121U Intel Cannon Lake 2 / 4 1.8 - 2.2 GHz**                    

*Skylake X processors run at different frequencies for different work-loads. These refer to the non-AVX/AVX/AVX512 frequencies.

**Variable frequency due to power throttling.

 

 

Version v0.7.3:

 

y-cruncher v0.7.3.9475 x64
Pi - 1 billion digits SSE3 SSE4.1 AVX XOP AVX2 ADX AVX512-CD AVX512-DQ
Processor Kasumi Ushio Hina Miyu Airi Kurumi Yukina   Kotori
Phenom II X4 AMD K10 2.8 GHz                  
2x Xeon X5482 Intel Penryn 3.2 GHz                  
Core i7 920 Intel Nehalem 3.5 GHz                  
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 394.581 343.320 299.276            
FX-8350 AMD Piledriver 4.0 GHz 350.312 348.505 402.221 245.184          
Core i7 4770K Intel Haswell 4.0 GHz 297.662 271.819 237.461   111.004        
Core i7 6820HK Intel Skylake 3.2 GHz 340.796 288.924 246.622   117.228 116.774 132.139    
Ryzen 7 1800X AMD Zen 3.7 GHz 151.941 142.441 143.236 107.959 104.811 102.943 99.072    
Core i9 7900X Intel Skylake X

4.5/4.0/3.8 GHz*

104.604 90.433 80.518   43.970 43.693 48.271 36.495 36.287

*Skylake Purley processors run at different frequencies for different work-loads. These refer to the non-AVX/AVX/AVX512 frequencies.

 

 

Version v0.7.2:

y-cruncher v0.7.2.9467* x86 x64
Pi - 250 million digits - SSE3 SSE3 SSE4.1 SSE4.1 AVX XOP AVX2 ADX
Processor     Kasumi Nagisa Ushio Hina Miyu Airi Kurumi Yukina
Core 2 Quad Q6600 Intel Core 2.4 GHz 308.054 226.110 157.537              
2x Xeon X5482 Intel Penryn 3.2 GHz 127.004 96.231 65.273 59.556 60.049          
Core i7 920 Intel Nehalem 3.5 GHz 168.209 113.770 83.253 74.924 74.230          
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 153.591 97.507 77.977 65.857 64.575 55.089        
FX-8350 AMD Piledriver 4.0 GHz 179.998 101.482 70.333 65.984 70.011 79.185 50.004      
Core i7 4770K Intel Haswell 4.0 GHz 119.946 73.454 58.086 53.083 52.502 45.654   22.695    
Core i7 6820HK Intel Skylake 3.2 GHz 138.649 81.961 66.701 55.539 55.298 47.184   23.715 23.557  
Ryzen 7 1800X AMD Zen 3.7 GHz     29.709 27.519 27.027 26.755 21.760 20.824 20.618 20.600
y-cruncher v0.7.2.9467* x64
Pi - 1 billion digits SSE3 SSE4.1 SSE4.1 AVX XOP AVX2 ADX
Processor Kasumi Nagisa Ushio Hina Miyu Airi Kurumi Yukina
Core 2 Quad Q6600 Intel Core 2.4 GHz 801.731              
2x Xeon X5482 Intel Penryn 3.2 GHz 328.268 303.304 308.679          
Core i7 920 Intel Nehalem 3.5 GHz 422.875 386.210 381.903          
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 410.495 348.969 348.743 299.217        
FX-8350 AMD Piledriver 4.0 GHz 352.722 332.738 346.144 406.305 249.087      
Core i7 4770K Intel Haswell 4.0 GHz 293.014 272.962 268.388 234.167   111.593    
Core i7 6820HK Intel Skylake 3.2 GHz 334.388 288.032 285.588 243.899   116.398 115.661 131.439
Ryzen 7 1800X AMD Zen 3.7 GHz 149.560 139.655 138.219 140.686 105.836 102.008 100.108 95.852

Version 0.7.2.9467 is not entirely speed consistent with the current release of v0.7.2.9468 due to a number of post-feature-freeze changes. But the performance fluctuations should be negligible enough that these benchmarks are representative of v0.7.2.9468.

 

 

Version v0.7.1:

y-cruncher v0.7.1.9465 x86 x64
Pi - 250 million digits - SSE3 SSE3 SSE4.1 AVX XOP AVX2 ADX
Processor     Kasumi Ushio Hina Miyu Airi Kurumi
Core 2 Quad Q6600 Intel Core 2.4 GHz 308.968 222.807 156.763          
Core i7 920 Intel Nehalem 3.5 GHz 168.149 113.313 83.244 80.000        
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 153.950 98.872 76.648 73.044 56.130      
FX-8350 AMD Piledriver 4.0 GHz 181.973 105.597 70.267 69.988 72.020 54.761    
Core i7 4770K Intel Haswell 4.0 GHz 118.395 73.312 58.699 55.169 46.588   24.227  
Core i7 6820HK Intel Skylake 3.2 GHz 138.380 83.017 67.255 62.939 48.250   25.059 24.877
y-cruncher v0.7.1.9465 x64
Pi - 1 billion digits SSE3 SSE4.1 AVX XOP AVX2 ADX
Processor Kasumi Ushio Hina Miyu Airi Kurumi
Core 2 Quad Q6600 Intel Core 2.4 GHz 803.675          
Core i7 920 Intel Nehalem 3.5 GHz 424.546 406.073        
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 404.973 393.676 313.130      
FX-8350 AMD Piledriver 4.0 GHz 353.717 351.583 376.275 266.824    
Core i7 4770K Intel Haswell 4.0 GHz 297.314 279.921 239.519   119.836  
Core i7 6820HK Intel Skylake 3.2 GHz 338.069 319.290 248.265   122.021 121.586

 

 

Version v0.6.9:

y-cruncher v0.6.9.9462 x86 x64
Pi - 250 million digits - SSE3 SSE3 SSE4.1 SSE4.1 AVX XOP AVX2
Processor     Kasumi Nagisa Ushio Hina Miyu Airi
Core 2 Quad Q6600 Intel Core 2.4 GHz 420.987 242.887 174.021          
2x Xeon X5482 Intel Penryn 3.2 GHz 179.118 109.439 72.111 72.011 71.582      
Core i7 920 Intel Nehalem 3.5 GHz 177.332 116.777 88.065 84.550 85.637      
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 175.253 112.748 85.591 89.339 81.713 72.034    
FX-8350 AMD Piledriver 4.0 GHz 217.327 113.313 77.238 79.227 77.273 105.003 57.478  
Core i7 4770K Intel Haswell 4.0 GHz 159.708 92.977 66.082 60.625 60.541 46.573   26.202
Core i7 6820HK Intel Skylake 3.2 GHz 165.881 97.104 69.067 68.068 68.136 50.341   26.248
y-cruncher v0.6.9.9462 x64
Pi - 1 billion digits SSE3 SSE4.1 SSE4.1 AVX XOP AVX2
Processor Kasumi Nagisa Ushio Hina Miyu Airi
Core 2 Quad Q6600 Intel Core 2.4 GHz 880.448          
2x Xeon X5482 Intel Penryn 3.2 GHz 352.635 336.713 347.962      
Core i7 920 Intel Nehalem 3.5 GHz 436.454 412.397 424.928      
Core i7 3630QM Intel Ivy Bridge 3.2 GHz 413.265 399.341 407.710 350.849    
FX-8350 AMD Piledriver 4.0 GHz 377.213 369.352 374.660 506.823 281.864  
Core i7 4770K Intel Haswell 4.0 GHz 311.231 295.216 300.480 231.361   129.71
Core i7 6820HK Intel Skylake 3.2 GHz 343.253 335.229 342.091 253.056   130.53

Note that the tuning parameters in v0.6.9 were very out-of-date. That's why the "Nagisa" and "Ushio" binaries don't always perform better on their target processors.