(Last updated: August 3, 2019)
Back To:
In terms of breaking size records, micro-optimizations and processor-specific optimizations are less important. That's because large computations are almost always bottlenecked by disk access. The 12.1 and 13.3 trillion digit computations of Pi both had a CPU utilization of less than 40% due to the disk bottleneck.
Nevertheless, one of the goals of y-cruncher is to be as fast as possible for the purposes of benchmarking and stress-testing.
y-cruncher does processor-specific optimizations in two ways:
Processor-specific instructions is the obvious one. Most of these involve vector instructions (SSE/AVX). But there are others as well.
The second one is processor-specific tuning. These are more subtle and involve stuff that are more architectural.
As of v0.7.8, y-cruncher targets the following processor lines with specially optimized binaries:
Binary | Target Processor(s) | Instruction Set Requirements | Tuned On | Notes |
18-CNL ~ Shinoa | Intel Cannon Lake |
x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2, AVX512-(F/CD/VL/BW/DQ/IFMA/VBMI) |
Core i3 8121U @ 2.3 - 3.2 GHz 8 GB LPDDR4 @ 2400 MHz |
Will likely be retuned for Ice Lake in the future. |
17-SKX ~ Kotori |
Intel Skylake Purley |
x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2, AVX512-(F/CD/VL/BW/DQ) |
Core i9 7940X @ 4.7/4.0/3.7/2.8 GHz (non-AVX/AVX/AVX512/cache) 128 GB DDR4 @ 3466 MHz |
|
17-ZN1 ~ Yukina | AMD Zen | x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2 |
Ryzen 7 1800X @ 3.8 GHz 64 GB DDR4 @ 2133 MHz |
|
16-KNL |
Intel Knights Landing |
x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2, AVX512-(F/CD) |
Untuned | Disabled since v0.7.8 due to EOL for Xeon Phi. |
14-BDW ~ Kurumi |
Intel Broadwell |
x64, ABM, BMI1, BMI2, ADX, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2 |
Core i7 6820HK @ 3.2 GHz 48 GB DDR4 @ 2133 MHz |
|
13-HSW ~ Airi |
Intel Haswell |
x64, ABM, BMI1, BMI2, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA3, AVX2 |
Core i7 5960X @ 4.0 GHz 64 GB DDR4 @ 2133 MHz |
|
11-BD1 ~ Miyu |
AMD Bulldozer |
x64, ABM, SSE(1, 2, 3, S3, 4.1, 4.2) AVX, FMA4, XOP |
FX-8350 @ 4.0 GHz 32 GB DDR3 @ 1333 MHz |
Uses 128-bit AVX since the entire processor line cannot efficiently run 256-bit AVX. |
11-SNB ~ Hina |
Intel Sandy Bridge |
x64, SSE(1, 2, 3, S3, 4.1, 4.2), AVX |
Core i7 3630QM @ 3.2 GHz 8 GB DDR3 @ 1334 MHz |
|
08-NHM ~ Ushio |
Intel Nehalem |
x64, SSE, SSE2, SSE3, SSSE3, SSE4.1 |
Core i7 920 @ 3.5 GHz 12 GB DDR3 @ 1333 MHz |
SSE4.2 doesn't seem to help on Nehalem like it does on later processors. |
07-PNR ~ Nagisa |
Intel Core 2 Penryn |
x64, SSE, SSE2, SSE3, SSSE3, SSE4.1 |
2 x Xeon X5482 @ 3.2 GHz 64 GB DDR2 @ 800 MHz |
Disabled since v0.6.5. But still maintained. |
05-A64 ~ Kasumi |
AMD Athlon 64, K8, K10 | x64, SSE, SSE2, SSE3 |
Phenom II X4 @ 2.8 GHz 12 GB DDR3 @ 1333 MHz |
Legacy binaries.
|
04-P4P |
Intel Pentium 4 Prescott | SSE, SSE2, SSE3 |
||
00-x86 |
- | none |
Each binary has two parts in its name:
These binaries reside in the "Binaries" folder and no attempt is made to hide them. Users are free and encouraged to run the binaries directly to see the effects of the various instruction sets. For everyone else who wants to keep things simple, the "y-cruncher" binary is a dispatcher that auto-selects the best binary to run.
Dispatch Algorithm:
When the user runs the dispatcher, it looks at the CPU vendor string to choose a list. Then it runs down the list in the following order until it finds one that can run.
AMD Processor | Non-AMD Processor |
17-CNL ~ Shinoa | 17-CNL ~ Shinoa |
17-SKX ~ Kotori | 17-SKX ~ Kotori |
16-KNL | 16-KNL |
17-ZD1 ~ Yukina | 14-BDW ~ Kurumi |
11-BD1 ~ Miyu | 13-HSW ~ Airi |
08-NHM ~ Ushio | 11-SNB ~ Hina |
05-A64 ~ Kasumi | 08-NHM ~ Ushio |
04-P4P | 05-A64 ~ Kasumi |
00-x86 | 04-P4P |
00-x86 |
The following restrictions also apply:
The dispatcher will give warnings for the following situations:
As of v0.7.8, y-cruncher (explicitly) uses the following instruction sets:
By "explicitly", we mean done using either compiler intrinsics or inline assembly. The actual instruction set requirement will be broader since the compiler itself will often generate other instructions that are implied by targeting a specific processor.
For example, y-cruncher doesn't use ABM instructions, but GCC generates them anyway when targeting Haswell or later. And while y-cruncher doesn't use BMI1 and AVX512-CD, the Intel Compiler generates them anyway when targeting Haswell and Knights Landing respectively.
Miscellaneous Instructions:
The 64-bit -> 128-bit multiply and add-with-carry instructions are all extremely useful for large multiplication. But since there's no standard C/C++ construct that will generate them, they must be generated either using compiler intrinsics or inline assembly.
Binaries that target Haswell or later will use the MULX instruction. Binaries that target Broadwell or later also use the ADCX and ADOX instructions. However, both instructions are disabled on Knights Landing since they are slow.
Vector Instructions:
Arbitrary precision arithmetic has historically been difficult to vectorize due to carry-propagation. But at large enough sizes, it's an entirely different game since most of the larger algorithms are either directly vectorizable, or can be made vectorizable with the right hacks.
We won't try to enumerate all the places where y-cruncher explicitly vectorizes. But there's a lot of them and they are all done manually with intrinsics and occasionally a bit of inline assembly. For the most part, everything performance-critical that can be vectorized has been done so manually. Very little is left to the compiler.
With the large number of target processors with varying sets of vector instructions, it may seem tedious to manually vectorize all of it. But there are plenty of tricks to abstract out most of the work without sacrificing performance.
In other words, y-cruncher does not have 10 different implementations for everything. One well-written implementation will suffice for all vector sizes from none (scalar) all the way to AVX512 and beyond. Furthermore, it will also handle all the various flavors of instructions within each vector size.
Unfortunately, these "vector-scalable" approaches do not extend all the way to GPUs since those require a completely different programming paradigm.
Tuning parameters cover everything else that isn't a processor-specific instruction. These are all things that affect performance, but do not affect the ability to run on other processors.
A partial list of these include:
While some parameters are tuned manually, most are done automatically using a classic superoptimizer.
For each given task, the superoptimizer benchmarks all available algorithms/configurations and puts the best one into a lookup table. This generates a table of "optimal" configurations. Due to the exhaustive nature of superoptimization, it also serves as a very thorough integration testing platform.
Since superoptimizer tuning is done with real benchmarks on an actual system, it takes into account everything - including stuff that are specific to the system rather than the processor line. As a result, the tuning fundamentally biases the results for the specific system it is run on. For the most part, superoptimizer tuning is preferably done on properly configured high-end desktop systems with as much memory as possible.
This tuning can have a significant effect on performance. And is the reason why the SSE4.1 binaries (which are tuned for Intel processors) run slower than the SSE3 binary on AMD Bulldozer processors.
y-cruncher has always had some form of superoptimization. But it didn't take it seriously until v0.4.3. Back in these early days of y-cruncher, it was often possible to squeeze out more than 10% performance over an untuned (or improperly tuned) configuration. Recent versions of y-cruncher don't gain nearly as much.
Why isn't the superoptimizer exposed to the user?
It's not that simple:
With rare exceptions, the superoptimizer produces nearly identical results for processors within the same generation. So it's sufficient to go "one size fits all" within each processor line. And of course, it keeps things a lot simpler as well.
All times in seconds.
Version v0.7.6:
y-cruncher v0.7.6.9483 | x64 | ||||||||||||
Pi - 1 billion digits | SSE3 | SSE4.1 | AVX | XOP | AVX2 | ADX | AVX512 | ||||||
CD | DQ | VBMI | |||||||||||
Processor | Generation | Cores/Threads | CPU Frequency | Kasumi | Ushio | Hina | Miyu | Airi | Kurumi | Yukina | Kotori | ||
Phenom II X4 | AMD K10 | 4 / 4 | 2.8 GHz | 648.422 | |||||||||
Core i7 920 | Intel Nehalem | 4 / 8 | 3.5 GHz | ||||||||||
Core i7 3630QM | Intel Ivy Bridge | 4 / 8 | 3.2 GHz | 401.458 | 357.863 | 314.457 | |||||||
FX-8350 | AMD Piledriver | 8 / 8 | 4.0 GHz | 337.487 | 316.194 | 375.843 | 219.344 | ||||||
Core i7 4770K | Intel Haswell | 4 / 8 | 4.0 GHz | 285.843 | 263.904 | 233.302 | 102.451 | ||||||
Core i7 6820HK | Intel Skylake | 4 / 8 | 3.2 GHz | 325.628 | 283.116 | 245.179 | 108.533 | 108.108 | 122.804 | ||||
Ryzen 7 1800X | AMD Zen | 8 / 16 | 3.7 GHz | 133.765 | 123.702 | 127.374 | 93.916 | 89.332 | 86.323 | 79.345 | |||
Core i9 7900X | Intel Skylake X | 10 / 20 | 4.3/4.0/3.6 GHz* |
104.689 | 92.045 | 81.732 | 38.586 | 38.359 | 42.373 | 29.687 | 28.827 | ||
Core i3 8121U | Intel Cannon Lake | 2 / 4 | 1.8 - 2.2 GHz** |
*Skylake X processors run at different frequencies for different work-loads. These refer to the non-AVX/AVX/AVX512 frequencies.
**Variable frequency due to power throttling.
Version v0.7.3:
y-cruncher v0.7.3.9475 | x64 | ||||||||||
Pi - 1 billion digits | SSE3 | SSE4.1 | AVX | XOP | AVX2 | ADX | AVX512-CD | AVX512-DQ | |||
Processor | Kasumi | Ushio | Hina | Miyu | Airi | Kurumi | Yukina | Kotori | |||
Phenom II X4 | AMD K10 | 2.8 GHz | |||||||||
2x Xeon X5482 | Intel Penryn | 3.2 GHz | |||||||||
Core i7 920 | Intel Nehalem | 3.5 GHz | |||||||||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 394.581 | 343.320 | 299.276 | ||||||
FX-8350 | AMD Piledriver | 4.0 GHz | 350.312 | 348.505 | 402.221 | 245.184 | |||||
Core i7 4770K | Intel Haswell | 4.0 GHz | 297.662 | 271.819 | 237.461 | 111.004 | |||||
Core i7 6820HK | Intel Skylake | 3.2 GHz | 340.796 | 288.924 | 246.622 | 117.228 | 116.774 | 132.139 | |||
Ryzen 7 1800X | AMD Zen | 3.7 GHz | 151.941 | 142.441 | 143.236 | 107.959 | 104.811 | 102.943 | 99.072 | ||
Core i9 7900X | Intel Skylake X | 4.5/4.0/3.8 GHz* |
104.604 | 90.433 | 80.518 | 43.970 | 43.693 | 48.271 | 36.495 | 36.287 |
*Skylake Purley processors run at different frequencies for different work-loads. These refer to the non-AVX/AVX/AVX512 frequencies.
Version v0.7.2:
y-cruncher v0.7.2.9467* | x86 | x64 | ||||||||||
Pi - 250 million digits | - | SSE3 | SSE3 | SSE4.1 | SSE4.1 | AVX | XOP | AVX2 | ADX | |||
Processor | Kasumi | Nagisa | Ushio | Hina | Miyu | Airi | Kurumi | Yukina | ||||
Core 2 Quad Q6600 | Intel Core | 2.4 GHz | 308.054 | 226.110 | 157.537 | |||||||
2x Xeon X5482 | Intel Penryn | 3.2 GHz | 127.004 | 96.231 | 65.273 | 59.556 | 60.049 | |||||
Core i7 920 | Intel Nehalem | 3.5 GHz | 168.209 | 113.770 | 83.253 | 74.924 | 74.230 | |||||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 153.591 | 97.507 | 77.977 | 65.857 | 64.575 | 55.089 | ||||
FX-8350 | AMD Piledriver | 4.0 GHz | 179.998 | 101.482 | 70.333 | 65.984 | 70.011 | 79.185 | 50.004 | |||
Core i7 4770K | Intel Haswell | 4.0 GHz | 119.946 | 73.454 | 58.086 | 53.083 | 52.502 | 45.654 | 22.695 | |||
Core i7 6820HK | Intel Skylake | 3.2 GHz | 138.649 | 81.961 | 66.701 | 55.539 | 55.298 | 47.184 | 23.715 | 23.557 | ||
Ryzen 7 1800X | AMD Zen | 3.7 GHz | 29.709 | 27.519 | 27.027 | 26.755 | 21.760 | 20.824 | 20.618 | 20.600 |
y-cruncher v0.7.2.9467* | x64 | |||||||||
Pi - 1 billion digits | SSE3 | SSE4.1 | SSE4.1 | AVX | XOP | AVX2 | ADX | |||
Processor | Kasumi | Nagisa | Ushio | Hina | Miyu | Airi | Kurumi | Yukina | ||
Core 2 Quad Q6600 | Intel Core | 2.4 GHz | 801.731 | |||||||
2x Xeon X5482 | Intel Penryn | 3.2 GHz | 328.268 | 303.304 | 308.679 | |||||
Core i7 920 | Intel Nehalem | 3.5 GHz | 422.875 | 386.210 | 381.903 | |||||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 410.495 | 348.969 | 348.743 | 299.217 | ||||
FX-8350 | AMD Piledriver | 4.0 GHz | 352.722 | 332.738 | 346.144 | 406.305 | 249.087 | |||
Core i7 4770K | Intel Haswell | 4.0 GHz | 293.014 | 272.962 | 268.388 | 234.167 | 111.593 | |||
Core i7 6820HK | Intel Skylake | 3.2 GHz | 334.388 | 288.032 | 285.588 | 243.899 | 116.398 | 115.661 | 131.439 | |
Ryzen 7 1800X | AMD Zen | 3.7 GHz | 149.560 | 139.655 | 138.219 | 140.686 | 105.836 | 102.008 | 100.108 | 95.852 |
Version 0.7.2.9467 is not entirely speed consistent with the current release of v0.7.2.9468 due to a number of post-feature-freeze changes. But the performance fluctuations should be negligible enough that these benchmarks are representative of v0.7.2.9468.
Version v0.7.1:
y-cruncher v0.7.1.9465 | x86 | x64 | ||||||||
Pi - 250 million digits | - | SSE3 | SSE3 | SSE4.1 | AVX | XOP | AVX2 | ADX | ||
Processor | Kasumi | Ushio | Hina | Miyu | Airi | Kurumi | ||||
Core 2 Quad Q6600 | Intel Core | 2.4 GHz | 308.968 | 222.807 | 156.763 | |||||
Core i7 920 | Intel Nehalem | 3.5 GHz | 168.149 | 113.313 | 83.244 | 80.000 | ||||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 153.950 | 98.872 | 76.648 | 73.044 | 56.130 | |||
FX-8350 | AMD Piledriver | 4.0 GHz | 181.973 | 105.597 | 70.267 | 69.988 | 72.020 | 54.761 | ||
Core i7 4770K | Intel Haswell | 4.0 GHz | 118.395 | 73.312 | 58.699 | 55.169 | 46.588 | 24.227 | ||
Core i7 6820HK | Intel Skylake | 3.2 GHz | 138.380 | 83.017 | 67.255 | 62.939 | 48.250 | 25.059 | 24.877 |
y-cruncher v0.7.1.9465 | x64 | |||||||
Pi - 1 billion digits | SSE3 | SSE4.1 | AVX | XOP | AVX2 | ADX | ||
Processor | Kasumi | Ushio | Hina | Miyu | Airi | Kurumi | ||
Core 2 Quad Q6600 | Intel Core | 2.4 GHz | 803.675 | |||||
Core i7 920 | Intel Nehalem | 3.5 GHz | 424.546 | 406.073 | ||||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 404.973 | 393.676 | 313.130 | |||
FX-8350 | AMD Piledriver | 4.0 GHz | 353.717 | 351.583 | 376.275 | 266.824 | ||
Core i7 4770K | Intel Haswell | 4.0 GHz | 297.314 | 279.921 | 239.519 | 119.836 | ||
Core i7 6820HK | Intel Skylake | 3.2 GHz | 338.069 | 319.290 | 248.265 | 122.021 | 121.586 |
Version v0.6.9:
y-cruncher v0.6.9.9462 | x86 | x64 | ||||||||
Pi - 250 million digits | - | SSE3 | SSE3 | SSE4.1 | SSE4.1 | AVX | XOP | AVX2 | ||
Processor | Kasumi | Nagisa | Ushio | Hina | Miyu | Airi | ||||
Core 2 Quad Q6600 | Intel Core | 2.4 GHz | 420.987 | 242.887 | 174.021 | |||||
2x Xeon X5482 | Intel Penryn | 3.2 GHz | 179.118 | 109.439 | 72.111 | 72.011 | 71.582 | |||
Core i7 920 | Intel Nehalem | 3.5 GHz | 177.332 | 116.777 | 88.065 | 84.550 | 85.637 | |||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 175.253 | 112.748 | 85.591 | 89.339 | 81.713 | 72.034 | ||
FX-8350 | AMD Piledriver | 4.0 GHz | 217.327 | 113.313 | 77.238 | 79.227 | 77.273 | 105.003 | 57.478 | |
Core i7 4770K | Intel Haswell | 4.0 GHz | 159.708 | 92.977 | 66.082 | 60.625 | 60.541 | 46.573 | 26.202 | |
Core i7 6820HK | Intel Skylake | 3.2 GHz | 165.881 | 97.104 | 69.067 | 68.068 | 68.136 | 50.341 | 26.248 |
y-cruncher v0.6.9.9462 | x64 | |||||||
Pi - 1 billion digits | SSE3 | SSE4.1 | SSE4.1 | AVX | XOP | AVX2 | ||
Processor | Kasumi | Nagisa | Ushio | Hina | Miyu | Airi | ||
Core 2 Quad Q6600 | Intel Core | 2.4 GHz | 880.448 | |||||
2x Xeon X5482 | Intel Penryn | 3.2 GHz | 352.635 | 336.713 | 347.962 | |||
Core i7 920 | Intel Nehalem | 3.5 GHz | 436.454 | 412.397 | 424.928 | |||
Core i7 3630QM | Intel Ivy Bridge | 3.2 GHz | 413.265 | 399.341 | 407.710 | 350.849 | ||
FX-8350 | AMD Piledriver | 4.0 GHz | 377.213 | 369.352 | 374.660 | 506.823 | 281.864 | |
Core i7 4770K | Intel Haswell | 4.0 GHz | 311.231 | 295.216 | 300.480 | 231.361 | 129.71 | |
Core i7 6820HK | Intel Skylake | 3.2 GHz | 343.253 | 335.229 | 342.091 | 253.056 | 130.53 |
Note that the tuning parameters in v0.6.9 were very out-of-date. That's why the "Nagisa" and "Ushio" binaries don't always perform better on their target processors.