Technology Wish list

(Last updated: February 21, 2024)

Back To:

Thoughout the course of the project, there have been many times where I've wanted improvements to things that are otherwise beyond my control. Everything from faster hard drives to that one missing x86 instruction falls into this category. I also get asked this on a regular basis. But rather than just making another FAQ entry, here's an entire article for it.

Therefore, this page is a gigantic wishlist of things that would benefit y-cruncher, large number arithmetic, and general high performance computing. Some of these are just ideas with just a potential to improve performance while others are more thought out and will bring immediate speedups. So if you're in the industry with the potential to influence future products, here is a whole page of feature requests to grab some ideas from.

Each item is rated on a scale of 1 to 10 on how useful I anticipate it would be. I also give my opinion on how difficult it might be to implement or achieve.

Feel free to reach out to me by email or twitter if you want to discuss.

Index:

System Architecture:

Bigger and Faster Storage

Improved I/O Reliability

Improved Memory Bandwidth

Bigger Caches

High Endurance Flash

Instruction Set Architecture:

Load and Evict from Cache

Load and Destroy

Range Prefetch

Predicated Prefetch

High 64x64-bit multiply (vpmulhq)

AVX512-IFMA Expansion

Faster AVX512 kshift and kadd

Adjacent Lane Permutes

Compiler Improvements:

Intrinsic for Complex Addressing

Nice Things that have Happened:

64-bit Arithmetic Right-Shift

64-bit Unsigned Integer Compare

64-bit Integer <-> Double Conversion

Variable Vector Shift

Acronyms:

GPR = General Purpose Register
LSU = Load Store Unit
reg-reg move = Register to Register Move Instruction

System Architecture:

Since 2017, y-cruncher has been almost completely communication bound. Likewise, the biggest performance improvements will be realized by improving bandwidth.

Bigger and Faster Storage:

Usefulness: 10

Difficulty to Implement: ?

In the current era, the bottleneck for Pi records is storage speed and capacity. Whoever has the biggest and fastest storage setup wins.

The current record for Pi sits at 100 trillion digits. This computation required nearly 500 TB of storage and the computation was 8-to-1 bottlenecked by storage bandwidth. That's a lot of storage. And it was far from being fast enough.

How much storage bandwidth do we need to not bottleneck on storage? As of y-cruncher v0.8.3, the optimal amount of storage bandwidth is about 1/4 of the total system memory bandwidth. Which means:

On the current highest end desktop processor (AMD Ryzen 7950X), this would be about 20 GB/s of I/O.

On the current highest end server (AMD Bergamo with 256 cores), this would be about 200 GB/s of I/O.

Needless to say, these numbers are very difficult to achieve - only doable by saturating the PCIe with high-end SSDs. Thus no Pi record to date (via y-cruncher) has had enough storage bandwidth to not bottleneck on it.

Note that CPU is not part of this equation since nearly every CPU since 2017 has been bottlenecked by memory bandwidth for y-cruncher:

If you were running in memory, the bottleneck would be the memory bandwidth.

If you were running on disk with insufficient disk bandwidth, the bottleneck would be the disk bandwidth.

If you were running on disk with sufficient disk bandwidth, the bottleneck would be the memory bandwidth.

In the unlikely event that you do have enough disk bandwidth, diminishing returns begins once you exceed about 1/3 of memory bandwidth as the I/O itself (which must pass through memory) begins to saturate the memory bandwidth on its own.

So getting enough storage bandwidth is hard enough as is. But now we also need a lot of it - 500TB just to match the current record. And this poses another difficulty:

Consumer SSDs are fast, but lack capacity.

Hard drives have capacity, but lack the speed.

The only option left are high-capacity enterprise SSDs which are in fact, both fast and high capcity. These can go up to 50TB+ per drive while full SSD bandwidth. But needless to say, they are also prohibitively expensive.

These high-capacity enterprise SSDs are a relatively new thing. And the way things are going, they may be the only feasible way to set new Pi records for the near future.

It is important to note that all of the storage needs to be fast. You can't have a small amount of SSD that meets the bandwidth requirements while the rest is in slow hard drives. y-cruncher's current implementation does not have the ability to exploit hierarchical storage. And the algorithm for Pi does not have that much locality to exploit in the first place.

Improved I/O Reliability:

Usefulness: 5

Difficulty to Implement: Very High

This used to be rated a 9 in usefulness but has since been downgraded as y-cruncher now has built-in I/O error detection.

There are two type of errors:

Non-silent Errors: A call to ReadFile()/WriteFile() and their Linux equivalents return an error due to the operation failing at the hardware level.

Silent Errors: ReadFile()/WriteFile() returns with no error, but the data being read or written is incorrect.

While both are bad, one of them is orders of magnitude worse than the other. And that would be the silent errors.

Large digit computations like Pi have zero tolerance for errors. A single bit flip will cause a computation to have the wrong digits. It's that plain and simple.

If a computation encounters a non-silent error, that's fine. You can retry the operation that failed or roll further back to a place that you can recover from. But for silent errors, that isn't possible because you don't know the error happened. So the error will propagate to the end leading to an incorrect result - thus leading to massive waste of time and resources.

To combat silent errors, y-cruncher is littered with redundancy checks. And starting from y-cruncher v0.7.8, all I/O operations (by default) have checksumming to detect silent errors and convert them into non-silent errors.

But really, it shouldn't be me who has to do all this. But why are silent errors so prevalent when all it takes is a CRC to detect nearly all of them?

This is the part that makes me really mad. Hard drives already have forward error correction. So they can easily detect uncorrectable errors. Given that silent errors are so much worse than non-silent errors, why don't they take that small extra step to make them non-silent?

After asking around, it seems that it might have been an active decision by hard drive manufacturers to silent error since an API call returning an error can break programs more than incorrect data. If this is indeed true, then whoever made that decision and standardized it in the industry needs to be fired.

Improved Memory Bandwidth:

Usefulness: 9

Difficulty to Implement: ???

Of all the requests on this page, this is the one that likely would benefit by far the largest audience. And it's only going to get bigger.

Skylake X brought a huge increase in computational power via AVX512 and increased core count. But the increases in memory bandwidth were limited to marginal clock speed increases. AMD has done the same with massive increases in core count. Similarly there have been little to no increases in memory bandwidth to complement.

For y-cruncher, the last time there was a good balance between compute power and memory bandwidth was Haswell-E. Since then, the situation has basically spun out of control.

AMD Zen 2 doubled up both their core count and their compute capability with 256-bit AVX parity with Intel. Memory bandwidth remains roughly the same from Zen 1. This makes the memory bottleneck even worse than Intel's Skylake X.

AMD Zen 4 adds AVX512 and DDR5 which roughly cancel each other out.

AMD Zen 5 is likely to double up the FPU to native 512-bit. Memory bandwidth is expected to say the same.

If Zen 5 performs as its leaks/rumors suggest it will, it is expected to make the memory bandwidth bottleneck so bad that an 8-core may end up performing the same as the 16-core at the same memory speed. Meanwhile the situation may be improving on Intel - not because they increased their bandwidth, but because they reduced their computing potential by dropping AVX512. hahahaha

Zen 4 needs double its current memory bandwidth to not be completely bottlenecked by it. Zen 5 is likely beyond comprehension.

Most of the high-end AMD Threadripper systems are missing half the memory channels. They are so bottlenecked by bandwidth that the performance between say a 32 vs. a 64-core is basically the same.

I know it will never happen, but I want quad-channel memory on client desktop.

Bigger Caches:

Usefulness: 7

Difficulty to Implement: Very High

While cache sizes have steadily grown over time, it's not actually enough. Total cache has increased, but cache per core has not. The total cache/thread has been stagnant at about 1 MB/thread for well over a decade now.

In perfectly parallelized workloads, all cores (and hyperthreads) will be fully utilized and independently working on different workloads to minimize synchronization. But this means that the current caches (which are actually pretty large) get divided up evenly across the many threads.

Why does this matter? To oversimplify a bit, for y-cruncher, the difference between 1 MB/thread and 32 MB/thread is up to ~2x in memory bandwidth consumption. Thus it ties into the previous section about memory bandwidth. If memory bandwidth cannot keep up with computational improvements, we can be help it by increasing caches.

The problem is that 32 MB/thread implies over a gigabyte of cache on current desktop systems. Because cache doesn't scale due to trace capacitance, I honestly don't see this happening any time soon. Instead, we see chip makers using the die area to cram in more cores - which unfortunately is less useful for memory-bound applications.

Are there any alternative approaches? While I'm not very knowledgable in cache design, perhaps large caches may be possible by sacrificing latency. For non-pointer chasing workloads, latency isn't that important with proper prefetching or sufficiently large out-of-order execution capability. So maybe a high latency + high bandwidth L4 cache? Upcoming versions of y-cruncher will be able to utilize many levels of cache.

What about Intel-style MCDRAM (Xeon Phi) or HBM (Sapphire Rapids)? Those are the right idea, but perhaps too extreme. It doesn't need to be that big and latencies are a bit high. The other problem is that HBM can only act as a direct mapped cache for ram and thus has no associativity. This is very bad for FFT-like workloads.

If chips with large caches do get released (like 3D-VCache), y-cruncher will not be able to automatically utilize it. It will need to be retuned. Most of the cache tuning parameters are hard-coded and auto-generated via superoptimizer running on actual hardware.

So if you're a chip engineer trying to see what kind of speedup y-cruncher will get with additional cache, you won't see it unless I tune for it. Feel free to chat me up about this topic.

High Endurance Flash:

(This section is out of date.)

Usefulness: 7

Difficulty to Implement: Medium

High endurance flash isn't that useful for record chasing since the capacities are not really large enough anyway. But it is very useful for development/testing of Swap Mode and more generally any sort of out-of-core application.

Hard drives are extremely slow, and getting slower relative to computional power. So development/testing of Swap Mode is extremely time-consuming. SSDs (especially NVMe) can make things much faster. But as I note here, they don't have the endurance to last very long. So my laboratory still uses hard drives for this purpose. SSDs are only for boot drives and other forms of "normal" usage.

So what we need is extremely high endurance flash. Something that can handle 100s of petabytes of writes.

This may already exist: Optane/3D XPoint. Intel refuses to give concrete numbers for endurance. (Even when I asked them this on stage at Hot Chips 30.) But the fact that they exist in DRAM form factor has to imply that it is significantly higher than normal flash. Another problem is that the capacities aren't really comparable to modern SSDs.

Another option is modern SLC SSDs. Modern SSDs have flash cells that are reliable enough to store multiple bits per cell (MLC). The problem is that MLC hurts endurance. Nevertheless, the industry is going all-in on MLC SSDs because they offer higher capacities. Is there enough of a market left for extremely high endurance SLC SSDs?

So technologically this is probably doable, but the economics doesn't allow it.

Instruction Set Architecture:

Most of the stuff here is meant for x86/64 with AVX512. But the concepts apply to any architecture.

You will notice that the caching instructions are generally ranked higher than all the computational ones in terms of usefulness. This is mostly a reflection of how memory-bound the project has become. Computational improvements are much less useful in the face of such a memory bottleneck.

Load and Evict from Cache:

Usefulness: 6

Difficulty to Implement: ???

Load 64 aligned bytes into a ZMM register. Evict the cacheline from all levels of cache or mark it as next to evict.

This doesn't even need to be a new instruction. Just change the behavior of vmovntdqa to this for normal (non-WC) memory.

One additional requirement is that the instruction has high burst throughput. One should be able to issue 20+ of them consecutively without it stalling execution (assuming the loads are hitting the L1 cache). In other words, the resulting background evictions shouldn't prevent the instruction from retiring.

(Update 2023: Intel has introduced the cldemote instruction for Sapphire Rapids. Does (load + cldemote) have the desired effect here? TBD...)

Motivation:

This is for streaming data you know won't be needed again for a while. So there's no point in keeping it cached. So kick it out right away so other cores don't need to steal ownership when they need it. Or mark it as next-to-evict from the current cache way.

There is a question of whether evicting it from cache is necessarily a good idea. It's possible there is another core (or even the hyperthread) that's also using the same data. Though I would assume such a scenario is rare.

Perhaps this instruction can be improved by adding an immediate specifying how many levels of cache to evict from.

Difficulty to Implement:

I don't know enough about cache design to competently judge how difficult this is to implement.

Load and Destroy:

Usefulness: 4

Difficulty to Implement: ???

Load 64 aligned bytes into a ZMM register. Then either zero the data or mark it as "unspecified".

The subsequent "destruction" of the data is weakly ordered and ordered only with sfence and any lock-prefixed instruction.

Similar to the load-and-evict instruction above, load-and-destroy also needs to have high burst throughput as they will be used in the same manner.

Motivation:

The idea here is to tell the CPU that the data isn't needed anymore. Don't flush it back to memory even if it's dirty since that wastes memory bandwidth.

The usecase here is scratch memory. Cache-aware algorithms often have their own "cache regions" where the code manually pulls data from memory into a contiguous buffer, does some computation and writes it back. When it is done using the scratch memory, it doesn't care about the contents anymore. So release the memory as efficiently as possible. In a way, this is similar to TRIM for SSDs.

This instruction can be made even more useful by allowing the "destroy" part to be predicated. This will eliminate a ton of duplicate code.

Difficulty to Implement:

I don't know enough about cache design to competently judge how difficult this is to implement. But is this what AMD's clzero does?

Unspecified state may be problematic though. If you don't flush data back to cache, it becomes undefined. I see a lot of potential for both direct security vulnerabilities as well as indirect (side-channel) ones.

Range Prefetch:

Usefulness: 5

Difficulty to Implement: High

A prefetch instruction that prefetches an entire range of addresses rather than just one cache line.

Motivation:

The proposal here is intentionally underspecified since I'm not sure what the best approach is. But the idea here is to eliminate explicit prefetching from the inner loops of most cache-blocked code. Likewise it will attempt to solve the problem of prefetch pacing.

Often times in cache-blocked algorithms, the current loop is running on a cache-sized block of data while prefetching the next block. Prefetching is then interspersed (littered) into the loop. This often leads to messy code and even GPR register pressure for multi-stream workloads.

Other complications are:

If the interspersed prefetching is too slow, the data won't be ready when it's needed in the next block.

If the interspersed prefetching is too fast, it may block execution if the memory queues are full.

If the computation is irregular and the next block is "too different" from the current block, it's difficult to find a prefetch scheme that works well.

It would be much easier to just tell the CPU directly, "I need this block of memory ASAP." The CPU can then pull it in at whatever pace the hardware can handle at that point of time. In other words, let the cache itself adaptively control the prefetch speed independently of the instruction execution. If it finishes the prefetch before the data is needed, cool. If not, performance will gracefully degrade depending on how far the hardware managed to get.

The biggest problem here is that many workloads have multiple data streams. Issuing multiple range prefetches which are done sequentially won't work if the prefetching can't keep up. If the start of any stream hasn't been prefetched by the time it is needed, the execution is still gonna stall.

So a single range prefetch is insufficient to communicate the intended access pattern to the processor. A typical FFT-like kernel will have 8 or more read streams which are accessed in a cyclic manner. How do I tell the processor this so that it can prefetch them in the order that I access them in?

Difficulty to Implement:

This requires some sort of microcode in the LSU to issue all the memory/cache requests. But other complications will arise.

Prioritization: Because a range prefetch is many prefetches in one, you can't have them block demand cache misses. So demand cache misses take priority. But this leads to another problem. What if there are so many demand cache misses that the prefetches get starved so long that the data isn't needed anymore? In other words, what you're prefetching has now been served as a demand cache miss. The intuitive answer is to cancel the prefetch when its cache line hits a demand miss. But with a range prefetch, there's an unboundedly large number of cache lines to keep track of.

Queuing: What happens if the code issues multiple range prefetches? How are those queued up? What if the queue is full?

Predicated Prefetch:

Usefulness: 4

Difficulty to Implement: Easy

A prefetch instruction that takes:

A Pointer (Memory Operand)

A Threshold (GPR)

An immediate specifying the type of prefetch.

It behaves like a normal SSE prefetch instruction, but if the address evaluates to a number greater than or equal to the threshold, the prefetch is suppressed and no request is made. Additional variants can be made for different types of comparisons.

Motivation:

In loops with prefetching, the code will be prefetching some # of iterations ahead. But when you get to the end of the loop, it will prefetch data that's not needed. In bandwidth-constrained workloads, these "out-of-bounds" prefetches hurt performance by consuming memory bandwidth.

The typical work-around is to peel the end of the loop and remove the prefetches. But this results in a lot of duplicated code - especially for large loop bodies. For example, y-cruncher has many inner-loop bodies that are hundreds to thousands of instructions long. While the code duplication in source code is easily avoided through template metaprogramming tricks, it still takes up instruction and uop cache space.

An alternate approach that I often use is to use a conditional move to set the prefetch pointer to something that's already in cache. This solves the problem of code duplication, but it adds overhead of not only the conditional moves, but also the requests into the LSU.

Difficulty to Implement:

A predicated prefetch should not be difficult to implement in hardware - though it may require an additional uop for the comparison.

Alternative approaches include:

Read the flags at which a predicated prefetch becomes a cmp+prefetch pair - which could then be macro-op fused?

Get rid of the comparison and directly use the lowest bit of a GPR. This is still useful for cache-blocked algorithms where the entire loop will be predicated as a whole rather than individual prefetches within the loop.

AVX512-PF indirectly supports predicated prefetching with the masking. (But this AVX512 variant dies with Xeon Phi.) And starting from Skylake, prefetches to null are suppressed. Unfortunately, setting to null doesn't work in software as only the first prefetch in a loop will be null while the rest will have offsets.

High 64x64-bit multiply (vpmulhq):

Usefulness: 7

Difficulty to Implement: Easy?

Same as vpmullq, but for the upper 64 bits. For each lane, multiply the 64-bit integers to form a 128-bit intermediate. Store the upper 64 bits in the destination.

Motivation:

The 64-bit Number-Theoretic Transform (NTT) will benefit greatly from this as both the Shoup and Montgomery methods for multiply-modulus require an upper-half multiply.

Implementations of the BBP algorithm that support 64-bit divisors need this for the Montgomery multiply-modulus.

Bignum libraries that use full-word arithmetic may also benefit from this.

y-cruncher's currently emulates vpmulhq with about 12-14 instructions depending on whether the inputs are preconditioned - which is terrible. As a result, the latest Pi record runs have spent a significant amount of CPU time bottlenecking on this operation. Basically any routine that isn't waiting for memory is likely grinding out an emulation for this instruction.

Difficulty to Implement:

On Intel processors, vpmullq is 3 uops - which is likely 3 uops to the double-precision FMA with custom normalization shifts.

The same approach will probably work to produce a 4-uop vpmulhq. 4-uops is still much better than the alternative of 12+ freaking macro-ops.

On AMD Zen4, vpmullq is 1 uop - meaning it has a native 64-bit multiply. Extending it to include the upper-half as well could be twice the area cost which is probably too much to ask for. But even a 4-uop implementation would be fine and bring huge gains to the relevant algorithms.

AVX512-IFMA Expansion:

Usefulness: 5

Difficulty to Implement: Easy

Currently there are only two instructions in this family, vpmadd52luq and vpmadd52huq. The mere existence of these already enables a completely new class of integer algorithms - bignum and not. But they can be made more useful with the following additions:

Give the full 231, 213, and 132 treatment that the floating-point FMAs enjoy.

Add multiply-only versions of the instructions that are non-destructive.

Add a negative-multiply add. The multiply remains unsigned, but rather than adding it to a 64-bit input, you subtract from it instead.

Add signed-multiply variants.

For all the IFMA instructions, add a shift to the multiplier determined by an immediate operand.

#1 and #2 will save a bunch of eliminate-able instructions like reg-reg movs and XOR-zeroing. It will also reduce register bandwidth pressure by reducing the number of input operands (helpful for AMD processors). #3 and #4 will have more material savings. #4 currently requires 4-6 instructions to emulate depending on whether the inputs are preconditioned.

Motivation:

As powerful as they are, the current vpmadd52luq and vpmadd52huq instructions seem overly specific for one usecase - the basecase multiply for RSA encryption. This particular usecase is very matrix-multiply-like and is nothing but multiply-adds that overwrite the accumulator.

In reality, the 52-bit multiply is useful for a lot of other things as well. But most attempts to use the current IFMA instructions for anything other than basecase multiply require working around the limitations of only having a destructive multiply-add.

vpmadd52luq is a faster alternative to vpmullq when you know the inputs are non-negative and less than 2⁵². Currently this requires zeroing the accumulator which is a reg-reg move or XOR-zero.

Various operations in partial-word bignum implementations don't want to overwrite the accumulator. The result is tons of reg-reg moves.

Long division involves subtractions. This would greatly benefit from a negative-multiply-add or a signed-multiply-add. In the cases where the sign of the division quotient is known statically, the overhead is "just" a bunch of reg-reg moves and subtractions. But when the signs are not known, there is no good solution.

When using bases less than 2⁵², there needs to be shifting. A very good explanation can be found here.

In particular, the signed-multiply will open up a new dimension to partial-word bignums by allowing efficient handling of balanced representations (representations where the words or "limbs" can be negative). In fact, a number of my research experiments have failed due to the need for balanced representations and the lack of an efficient way to multiply them.

For now, there's no obvious need for any of the multiply-subtract variants. Sign propagation generally eliminates the need for these. The same applies to like half the floating-point FMA instructions. But they still exist anyway...

Difficulty to Implement:

#1, #2, and #3 should be trivial to implement. Not sure about #4, but I don't anticipate that to be difficult either.

The shifting (#5) can be absorbed into the shift-normalization logic that the floating-point FMAs already have.

Here's a possible specification for the proposed new instructions:

Operation Type Name Assembly Operation

Multiply Only

(non-destructive)
Low: vpmul52luq vpmul52luq zmm0, zmm1, zmm2, imm zmm0 = umullo52(zmm1, zmm2) * 2^imm

Unsigned High: vpmul52huq vpmul52huq zmm0, zmm1, zmm2, imm zmm0 = umulhi52(zmm1, zmm2) * 2^imm

Signed High: vpmul52hq vpmul52hq zmm0, zmm1, zmm2, imm zmm0 = smulhi52(zmm1, zmm2) * 2^imm

Positive Multiply Add
Low: vpmadd52l132uq
vpmadd52l213uq

vpmadd52l231uq

vpmadd52l132uq zmm0, zmm1, zmm2, imm

vpmadd52l213uq zmm0, zmm1, zmm2, imm

vpmadd52l231uq zmm0, zmm1, zmm2, imm

zmm0 = zmm1 + umullo52(zmm0, zmm2) * 2^imm

zmm0 = zmm2 + umullo52(zmm0, zmm1) * 2^imm

zmm0 = zmm0 + umullo52(zmm1, zmm2) * 2^imm

Unsigned High:
vpmadd52h132uq

vpmadd52h213uq

vpmadd52h231uq

vpmadd52h132uq zmm0, zmm1, zmm2, imm

vpmadd52h213uq zmm0, zmm1, zmm2, imm

vpmadd52h231uq zmm0, zmm1, zmm2, imm

zmm0 = zmm1 + umulhi52(zmm0, zmm2) * 2^imm

zmm0 = zmm2 + umulhi52(zmm0, zmm1) * 2^imm

zmm0 = zmm0 + umulhi52(zmm1, zmm2) * 2^imm

Signed High:
vpmadd52h132q

vpmadd52h213q

vpmadd52h231q

vpmadd52h132q zmm0, zmm1, zmm2, imm

vpmadd52h213q zmm0, zmm1, zmm2, imm

vpmadd52h231q zmm0, zmm1, zmm2, imm

zmm0 = zmm1 + smulhi52(zmm0, zmm2) * 2^imm

zmm0 = zmm2 + smulhi52(zmm0, zmm1) * 2^imm

zmm0 = zmm0 + smulhi52(zmm1, zmm2) * 2^imm

Negative Multiply Add
Low:
vpnmadd52l132uq

vpnmadd52l213uq

vpnmadd52l231uq

vpnmadd52l132uq zmm0, zmm1, zmm2, imm

vpnmadd52l213uq zmm0, zmm1, zmm2, imm

vpnmadd52l231uq zmm0, zmm1, zmm2, imm

zmm0 = zmm1 - umullo52(zmm0, zmm2) * 2^imm

zmm0 = zmm2 - umullo52(zmm0, zmm1) * 2^imm

zmm0 = zmm0 - umullo52(zmm1, zmm2) * 2^imm

Unsigned High:
vpnmadd52h132uq

vpnmadd52h213uq

vpnmadd52h231uq

vpnmadd52h132uq zmm0, zmm1, zmm2, imm

vpnmadd52h213uq zmm0, zmm1, zmm2, imm

vpnmadd52h231uq zmm0, zmm1, zmm2, imm

zmm0 = zmm1 - umulhi52(zmm0, zmm2) * 2^imm

zmm0 = zmm2 - umulhi52(zmm0, zmm1) * 2^imm

zmm0 = zmm0 - umulhi52(zmm1, zmm2) * 2^imm

Signed High:
vpnmadd52h132q

vpnmadd52h213q

vpnmadd52h231q

vpnmadd52h132q zmm0, zmm1, zmm2, imm

vpnmadd52h213q zmm0, zmm1, zmm2, imm

vpnmadd52h231q zmm0, zmm1, zmm2, imm

zmm0 = zmm1 - smulhi52(zmm0, zmm2) * 2^imm

zmm0 = zmm2 - smulhi52(zmm0, zmm1) * 2^imm

zmm0 = zmm0 - smulhi52(zmm1, zmm2) * 2^imm

Notes:

umullo52(A, B): Take the bottom 52 bits of A and B as unsigned 52-bit integers. Multiply together to form a 104-bit product. Return lower 52 bits.

umulhi52(A, B): Take the bottom 52 bits of A and B as unsigned 52-bit integers. Multiply together to form a 104-bit product. Return upper 52 bits.

smulhi52(A, B): Take the bottom 52 bits of A and B as signed 52-bit integers (two's complement). Multiply together to form a product as a 104-bit two's complement integer. Arithmetic right shift this 104-bit integer by 52 bits. Return the lower 64 bits.

It's unspecified what should happen if imm is greater than 12. In reality, we don't care since it won't happen intentionally. So whatever is most reasonable or easiest to implement in hardware will work.

Also unspecified is whether to treat imm as a signed integer and allow negative shifts (or a right shift). I don't see a usecase for this so it's probably not necessary.

Faster AVX512 kshift and kadd:

Usefulness: 3

Difficulty to Implement: Potentially Difficult

All mask instructions which have cross-bit dependencies have 4 cycle latency on Skylake X. This is really slow. Can this be brought down?

(Update 2022: Centaur CNS and AMD Zen4 have both implemented AVX512 with 1-cycle latency for mask instructions. But more investigation is needed to see if these are true 1-cycle latency or if the overhead is being moved elsewhere such as domain switching delays.)

Motivation:

The 512-bit add-with-carry!

y-cruncher's benefit from this is marginal since it's not the performance bottleneck. But other bignum libraries may get much more.

Difficulty to Implement:

This may actually be more difficult than it looks due the design of the k registers. These mask registers are fast when accessed with 32-bit or 64-bit granular AVX512 instructions. This implies that each individual bit of the mask is reasonably close to its respective lane in the SIMD unit. But lane-crossing mask instructions (like kshift and kadd) have cross-lane dependencies. Given the size of the SIMD execution units and the distances between them, the mask bits would need to travel quite far on the die to "meet".

Adjacent Lane Permutes:

(This can be considered withdrawn at this point. Most things that require this can be done better using AVX512's all-to-all shuffles - even if it's less efficient in terms of silicon usage.)

Usefulness: 1

Difficulty to Implement: Easy

Let's call Adjacent Lane Permutes "ALPs" for short. This topic is big enough to write an entire blog. But until that happens, a short summary will have to do.

Given a lane size (32 bits, 64 bits, 128 bits, etc...), allow arbitrary permutations within adjacent lanes from two different input vectors. There are 4 types of adjacent lane permutes that are needed.

Input A: [A0, A1][A2, A3][...]

Input B: [B0, B1][B2, B3][...]

Name Description Returns

Low Grab the lower element in each pair of adjacent lanes from both inputs. [A0, B0][A2, B2][...]

High Grab the upper element in each pair of adjacent lanes from both inputs. [A1, B1][A3, B3][...]

Blend For each pair of lanes, grab the lower element from input A and the upper element from input B. [A0, B1][A2, B3][...]

Cross For each pair of lanes, grab the upper element from input A and the lower element from input B. [A1, B0][A3, B2][...]

Each lane size will need all 4 of these. On AVX512, there are 6 lane sizes going down to byte granularity: 256, 128, 64, 32, 16, and 8.

(Though I've personally never had a need to go below 32-bit granularity.)

vpunpcklqdq and vpunpckhqdq are examples of 64-bit ALPs. (low/high variants)

vperm2f128 and vperm2i128 are examples of 128-bit ALPs. (all 4 variants, but no 512-bit version exists)

vshufpd is a 64-bit ALP. (all 4 variants)

vpunpckldq and vpunpckhdq are not ALPs because they don't pull from adjacent 32-bit lanes.

vshuff64x2 and vshufi64x2 can do 256-bit ALPs (all 4 variants), but they can't do any of the 128-bit ALPs because they can't pull from adjacent lanes.

vinsertf64x4 and vinserti64x4 are both 256-bit ALPs. (low and blend variants)

In short, some of the ALPs already exist. But many are missing. The missing ones currently need to be emulated.

Motivation:

Adjacent lane permutes are a set of building blocks for efficiently performing arbitrary dimension SIMD transposes such as 8x5, 16x13, etc... These are critical for efficiently performing AOS <-> SOA conversions which are subsequently critical for vectorizing a lot of code that is otherwise not vectorizable. Thus, ALPs allow odd-sized transposes to be done significantly faster than both gather/scatter and scalar code.

On Skylake X with AVX512-F, all ALPs of 32-bit granularity or larger can be done with a single 1-uop instruction. 16-bit and 8-bit granularities will require up to 2 or 3 uops respectively. Cannon Lake with AVX512-VBMI brings the 16-bit and 8-bit granularities down to 1 and 2 uops respectively.

But in all cases, there is a substantial amount of overhead that includes:

Permute vectors.

Mask constants.

Reg-reg moves due to the destructiveness of 2-input permutes and embedded blend-merging.

Type-casts due to some permutes existing in one type, but not the others. (Mostly a usage inconvenience in C/C++ with intrinsics.)

Mask register pressure also becomes a problem for smaller granularity transposes.

Difficulty to Implement:

The nature of the ALPs is that data locality increases with smaller granularities. So it is not necessary to have the expensive all-to-all permute hardware. It is possible to implement all of the ALPs with only O(N log(N)) transistors and routing traffic (where N is the bitlength of the SIMD vector).

Since AVX512 already has single-uop all-to-all permutes, it should be trivial to implement all the missing ALPs with single-uop instructions. The smaller granular ALPs aren't efficient this way, but since they have high locality, they are likely doable with minimal extra hardware as single-cycle latency instructions.

Overall, I rate the "usefulness" of native ALPs as only a 2 because:

Data shuffling is generally not a performance bottleneck because it's usually possible to optimize it out of most critical compute kernels.

All of the ALPs are either already directly supported or can be emulated in AVX512 with reasonable efficiency.

But since they are theoretically easy to implement in hardware, it's worth asking for them in some future processor.

Compiler Improvements:

Not everything needs to be a hardware improvement. Software improvements are possible too (and easier to implement).

Intrinsic for Complex Addressing:

Usefulness: 5

Difficulty to Implement: Very Easy

A common pattern in FFT-like workloads is strided memory access. For example, a radix 8 butterfly will have 8 streams or more memory streams:

8 streams if in-place.

16 streams if out-of-place.

24 streams if out-of-place + prefetching in the future.

Regardless of whether these streams are represented as pointers or offset indices, they all run into the same problem - register pressure. x64 simply does not have enough general purpose registers to adequately suit these types of access patterns.

There is one trick solves this - a method I call "pointer folding". Say you want to access 8 streams. Thus you want to create the effect of:

    __m512d* T0 = T + 0*stride;

    __m512d* T1 = T + 1*stride;

    __m512d* T2 = T + 2*stride;

    __m512d* T3 = T + 3*stride;

    __m512d* T4 = T + 4*stride;

    __m512d* T5 = T + 5*stride;

    __m512d* T6 = T + 6*stride;

    __m512d* T7 = T + 7*stride;

    for (...){

        __m512d r0 = T0[0];

        __m512d r1 = T1[0];

        __m512d r2 = T2[0];

        __m512d r3 = T3[0];

        __m512d r4 = T4[0];

        __m512d r5 = T5[0];

        __m512d r6 = T6[0];

        __m512d r7 = T7[0];

        // Do work

      T0 += 1;

        T1 += 1;

        T2 += 1;

        T3 += 1;

        T4 += 1;

        T5 += 1;

        T6 += 1;

        T7 += 1;

    }

But if you don't have 8 available registers to hold the pointers, you can fold them into 4 registers as follows:

    char* T0 = (char*)(T + 0*stride);

    char* T7 = (char*)(T + 7*stride);

    size_t p = stride * sizeof(__m512d);

    size_t n = -stride;

    for (...){

        __m512d r0 = _mm512_load_pd(T0);

        __m512d r1 = _mm512_load_pd(T0 + 1*p);

        __m512d r2 = _mm512_load_pd(T0 + 2*p);

        __m512d r3 = _mm512_load_pd(T7 + 4*n);

        __m512d r4 = _mm512_load_pd(T0 + 4*p);

        __m512d r5 = _mm512_load_pd(T7 + 2*n);

        __m512d r6 = _mm512_load_pd(T7 + 1*n);

        __m512d r7 = _mm512_load_pd(T7);

        // Do work

        T0 += 64;

        T7 += 64;

    }

Subsequently, the start of the loop would ideally compile to something like this:

vmovapd zmm0, zmmword ptr [T0]
vmovapd zmm1, zmmword ptr [T0 + 1*p]
vmovapd zmm2, zmmword ptr [T0 + 2*p]
vmovapd zmm3, zmmword ptr [T7 + 4*n]
vmovapd zmm4, zmmword ptr [T0 + 4*p]
vmovapd zmm5, zmmword ptr [T7 + 2*n]
vmovapd zmm6, zmmword ptr [T7 + 1*n]
vmovapd zmm7, zmmword ptr [T7]

Thus it is now possible to do 16 or even 24 streams with only 16 general purpose registers. Great! Furthermore, this approach can be trivially extended to up to 12 streams of equal stride.

Well not so fast. The problem is that compilers won't actually let this happen. When a compiler sees a common subexpression such as (4*n), it will try to optimize it out. This leads to more "live values" and thus more registers which leads to crazy amounts of register spilling.

The only compiler I've seen that does a decent job of preserving the intent of the above code is the Intel Compiler Classic (ICC). But that compiler is deprecated and its replacement, Intel LLVM Compiler (ICX), is drastically inferior in this aspect (among other things).

Proposed Solution:

Add an intrinsic for Sib (complex addressing). So maybe one day we can write it like this:

    char* T0 = (char*)(T + 0*stride);

    char* T7 = (char*)(T + 7*stride);

    size_t p = stride * sizeof(__m512d);

    size_t n = -stride;

    for (...){

        __m512d r0 = _mm512_load_pd(T0);

        __m512d r1 = _mm512_load_pd(_sib(T0, p, 1, 0)); // T0 + 1*p + 0

        __m512d r2 = _mm512_load_pd(_sib(T0, p, 2, 0)); // T0 + 2*p + 0

        __m512d r3 = _mm512_load_pd(_sib(T7, n, 4, 0));

        __m512d r4 = _mm512_load_pd(_sib(T0, p, 4, 0));

        __m512d r5 = _mm512_load_pd(_sib(T7, n, 2, 0));

        __m512d r6 = _mm512_load_pd(_sib(T7, n, 1, 0));

        __m512d r7 = _mm512_load_pd(T7);

        // Do work

        T0 += 64;

        T7 += 64;

    }

Funny enough, the gather/scatter SIMD intrinsics already support this. We just need to extend the functionality to regular load/stores.

Nice Things that have Happened:

Sometimes, we do get nice things - but only sometimes...

Here's all the things I've wanted in the past that eventually came to life! This list omits all the things that turned out to be useful after they had been announced.

64-bit Arithmetic Right-Shift:

Usefulness: 3

Launched: XOP (2011), AVX512 (2016)

Instruction(s): vpshaq, vpsraq

The 64-bit arithmetic shift was one of the glaring omissions of the entire SSE line. AMD XOP instruction finally supported it in 2011. But XOP is a dead-end instruction set. So this wasn't realistically added until AVX512. And we had to wait until Skylake X in 2017 to get it for real since Xeon Phi never went mainstream.

In most cases, this could be emulated with 3 instructions.

64-bit Unsigned Integer Compare:

Usefulness: 3

Launched: XOP (2011), AVX512 (2016)

Instruction(s): vpcomuq, vpcmpuq

Same as the 64-bit arithmetic shift in every single way. Also could be emulated with 3 instructions.

64-bit Integer <-> Double Conversion:

Usefulness: 2

Launched: AVX512-DQ (2017)

Instruction(s): vcvtqq2pd, vcvtuqq2pd, vcvtpd2uqq, vcvtpd2qq, vcvttpd2qq, vcvttpd2uqq

Yet another glaring omission from the SSE line. Fortunately most of these could be emulated with 2 instructions for the common use cases.

Variable Vector Shift:

Usefulness: 2

Launched: XOP (2011), AVX2 (2013)

Instruction(s): vpsh*, vpsllv*, vpsrlv*, vpsrav*

Another annoying omission from the SSE line. Fortunately, y-cruncher only ever needed the 64-bit versions which are the cheapest to emulate.

Operation Type		Name	Assembly	Operation
Multiply Only (non-destructive)	Low:	vpmul52luq	vpmul52luq zmm0, zmm1, zmm2, imm	zmm0 = umullo52(zmm1, zmm2) * 2^imm
	Unsigned High:	vpmul52huq	vpmul52huq zmm0, zmm1, zmm2, imm	zmm0 = umulhi52(zmm1, zmm2) * 2^imm
	Signed High:	vpmul52hq	vpmul52hq zmm0, zmm1, zmm2, imm	zmm0 = smulhi52(zmm1, zmm2) * 2^imm
Positive Multiply Add	Low:	vpmadd52l132uq vpmadd52l213uq vpmadd52l231uq	vpmadd52l132uq zmm0, zmm1, zmm2, imm vpmadd52l213uq zmm0, zmm1, zmm2, imm vpmadd52l231uq zmm0, zmm1, zmm2, imm	zmm0 = zmm1 + umullo52(zmm0, zmm2) * 2^imm zmm0 = zmm2 + umullo52(zmm0, zmm1) * 2^imm zmm0 = zmm0 + umullo52(zmm1, zmm2) * 2^imm
	Unsigned High:	vpmadd52h132uq vpmadd52h213uq vpmadd52h231uq	vpmadd52h132uq zmm0, zmm1, zmm2, imm vpmadd52h213uq zmm0, zmm1, zmm2, imm vpmadd52h231uq zmm0, zmm1, zmm2, imm	zmm0 = zmm1 + umulhi52(zmm0, zmm2) * 2^imm zmm0 = zmm2 + umulhi52(zmm0, zmm1) * 2^imm zmm0 = zmm0 + umulhi52(zmm1, zmm2) * 2^imm
	Signed High:	vpmadd52h132q vpmadd52h213q vpmadd52h231q	vpmadd52h132q zmm0, zmm1, zmm2, imm vpmadd52h213q zmm0, zmm1, zmm2, imm vpmadd52h231q zmm0, zmm1, zmm2, imm	zmm0 = zmm1 + smulhi52(zmm0, zmm2) * 2^imm zmm0 = zmm2 + smulhi52(zmm0, zmm1) * 2^imm zmm0 = zmm0 + smulhi52(zmm1, zmm2) * 2^imm
Negative Multiply Add	Low:	vpnmadd52l132uq vpnmadd52l213uq vpnmadd52l231uq	vpnmadd52l132uq zmm0, zmm1, zmm2, imm vpnmadd52l213uq zmm0, zmm1, zmm2, imm vpnmadd52l231uq zmm0, zmm1, zmm2, imm	zmm0 = zmm1 - umullo52(zmm0, zmm2) * 2^imm zmm0 = zmm2 - umullo52(zmm0, zmm1) * 2^imm zmm0 = zmm0 - umullo52(zmm1, zmm2) * 2^imm
	Unsigned High:	vpnmadd52h132uq vpnmadd52h213uq vpnmadd52h231uq	vpnmadd52h132uq zmm0, zmm1, zmm2, imm vpnmadd52h213uq zmm0, zmm1, zmm2, imm vpnmadd52h231uq zmm0, zmm1, zmm2, imm	zmm0 = zmm1 - umulhi52(zmm0, zmm2) * 2^imm zmm0 = zmm2 - umulhi52(zmm0, zmm1) * 2^imm zmm0 = zmm0 - umulhi52(zmm1, zmm2) * 2^imm
	Signed High:	vpnmadd52h132q vpnmadd52h213q vpnmadd52h231q	vpnmadd52h132q zmm0, zmm1, zmm2, imm vpnmadd52h213q zmm0, zmm1, zmm2, imm vpnmadd52h231q zmm0, zmm1, zmm2, imm	zmm0 = zmm1 - smulhi52(zmm0, zmm2) * 2^imm zmm0 = zmm2 - smulhi52(zmm0, zmm1) * 2^imm zmm0 = zmm0 - smulhi52(zmm1, zmm2) * 2^imm

Name	Description	Returns
Low	Grab the lower element in each pair of adjacent lanes from both inputs.	[A0, B0][A2, B2][...]
High	Grab the upper element in each pair of adjacent lanes from both inputs.	[A1, B1][A3, B3][...]
Blend	For each pair of lanes, grab the lower element from input A and the upper element from input B.	[A0, B1][A2, B3][...]
Cross	For each pair of lanes, grab the upper element from input A and the lower element from input B.	[A1, B0][A3, B2][...]