(Last updated: June 20, 2023)
Back To:
Thoughout the course of the project, there have been many times where I've wanted improvements to things that are otherwise beyond my control. Everything from faster hard drives to that one missing x86 instruction falls into this category. I also get asked this on a regular basis. But rather than just making another FAQ entry, here's an entire article for it.
Therefore, this page is a gigantic wishlist of things that would benefit y-cruncher, large number arithmetic, and general high performance computing. Some of these are just ideas with just a potential to improve performance while others are more thought out and will bring immediate speedups. So if you're in the industry with the potential to influence future products, here is a whole page of feature requests to grab some ideas from.
Each item is rated on a scale of 1 to 10 on how useful I anticipate it would be. I also give my opinion on how difficult it might be to implement or achieve.
Feel free to reach out to me by email or twitter if you want to discuss.
Index:
- Improved I/O Bandwidth
- Improved I/O Reliability
- Improved Memory Bandwidth
- Bigger Caches
- High Endurance Flash
Acronyms:
Since 2017, y-cruncher has almost completely communication bound. Likewise, the biggest performance improvements will be realized by improving bandwidth.
Usefulness: 10
Difficulty to Implement: Medium
The latest record for the most digits of Pi was 8-to-1 bottlenecked by disk bandwidth. So it goes without saying that this is the thing to fix.
With current hardware, we need about 20 GB/s of disk bandwidth to not severely bottleneck on disk access. And at the current record front, there needs to be 200+ TB of storage - all of which must be able to sustain 20 GB/s of bandwidth. While it's easy to achieve one or the other, getting both at the same time is more difficult.
- The largest hard drives are 16 TB. So only a dozen of them will be needed to reach the capacity target. The problem is that hard drives are capped to about 200 MB/s of sustained sequential bandwidth. So you'll need 100 of them to achieve 20 GB/s.
- NVMe SSDs can easily do 3 GB/s read and 1+ GB/s of sustained writes. So a dozen of these will hit the 20 GB/s. While there are no motherboards that have that many M.2 slots, you can use PCIe adaptors to get there. The problem is that the largest NVMe SSDs are only 2 TB large. So you'll need 100 of these to reach 200 TB.
Attempting to cache a large hard drive array using NVMe is not likely to bring a huge benefit due to the lack of data locality in the algorithms involved. Though the extent of this is unclear. Whatever the case is, non-trivial software changes will be needed to properly exploit an NVMe cache. Likewise use of SSDs will inevitably run into the issue of endurance.
In any case, #2 is not feasible due to the limited number of PCIe lanes on current systems. On the other hand, #1 is theoretically possible.
In order to get 100 hard drives into a single motherboard, you need to fill every PCIe slot with very large storage controllers. As of this writing, there exists 24-port SAS/SATA PCIe cards. So 4 of them basically gets you to 100. But it's unclear if any of these cards can actually sustain the full bandwidth of every single hard drive running full speed. Afterall, we're talking 5 GB/s per card. The PCIe slots can handle the bandwidth, so the question is about the card.
What I want to see is a daisy-chainable storage system that can utilize the full bandwidth of the PCIe slots and can fan out to hundreds of SATA devices.
Going beyond storage, I/O bandwidth can also mean network bandwidth. So 20 GB/s of network bandwidth can be translated to 20 GB/s of storage bandwidth. Network bandwidth is arguably more useful for if and when the project goes supercomputing to utilize multiple machines at once. In that scenario, the extreme bandwidth requirements don't go away. But instead of storage traffic, it becomes network traffic.
The 10 rating here (meaning most useful) is just for the purpose of setting digit records. The usefulness beyond this is more limited since few HPC applications do out-of-core computation to the extent that digit computations do.
Usefulness: 5
Difficulty to Implement: Very High
This used to be rated a 9 in usefulness but has since been downgraded as y-cruncher now has built-in I/O error detection.
The consequence of extremely large storage systems is reliability. And unfortunately, mass storage doesn't have a good track record for that.
It's unclear what needs to be done hardware-wise to make things reliable. It's one thing to have errors. It's another to not detect them. Given that unreliable mass storage has been the status quo for many years, it's unlikely it will just change in the near future.
Usefulness: 8
Difficulty to Implement: ???
Of all the requests on this page, this is the one that likely would benefit by far the largest audience. And it's only going to get bigger.
Skylake X brought a huge increase in computational power via AVX512 and increased core count. But the increases in memory bandwidth were limited to marginal clock speed increases. AMD has done the same with massive increases in core count. Similarly there have been little to no increases in memory bandwidth to complement.
For y-cruncher, the last time there was a good balance between compute power and memory bandwidth was Haswell-E. Since then, the situation has basically spun out of control. Even without AVX512, Zen 2 is expected to be the worst to date simply due to core count and 256-bit AVX parity with Intel.
Some major improvements were made between v0.7.3 and v0.7.5 to reduce the bandwidth usage. But these have hit a limit as much of the code is now within a small constant factor of being "memory-optimal" for the current set of algorithms. So now everything is at the mercy of hardware.
At the current cache sizes, the ideal computation-to-bandwidth ratio for y-cruncher seems to be around 10 FLOPs / byte of DRAM access. This corresponds approximately to an 8-core Haswell at 4 GHz with quad-channel DDR4 @ 2133. On Skylake X, it's easily 20+ FLOPs/byte. The 16-core Zen 2 Ryzen and the anticipated 64-core Zen 2 Threadripper are expected to be at 30+ and 60+ FLOPs/byte respectively. This is bad enough that you can probably disable most of the cores without really affecting the performance.
What about larger caches? See the next section.
Usefulness: 7
Difficulty to Implement: Very High
While cache sizes have steadily grown over time, it's not actually enough. Total cache has increased, but cache per core has not. The total cache/thread has been stagnant at about 1 MB/thread for well over a decade now.
In perfectly parallelized workloads, all cores (and hyperthreads) will be fully utilized and independently working on different workloads to minimize synchronization. But this means that the current caches (which are actually pretty large) get divided up evenly across the many threads.
Why does this matter? To oversimplify a bit, for y-cruncher, the difference between 1 MB/thread and 32 MB/thread is up to ~2x in memory bandwidth consumption. Thus it ties into the previous section about memory bandwidth. If memory bandwidth cannot keep up with computational improvements, we can be help it by increasing caches.
The problem is that 32 MB/thread implies over a gigabyte of cache on current desktop systems. Because cache doesn't scale due to trace capacitance, I honestly don't see this happening any time soon. Instead, we see chip makers using the die area to cram in more cores - which unfortunately is less useful for memory-bound applications.
Are there any alternative approaches? While I'm not very knowledgable in cache design, perhaps large caches may be possible by sacrificing latency. For non-pointer chasing workloads, latency isn't that important with proper prefetching or sufficiently large out-of-order execution capability. So maybe a high latency + high bandwidth L4 cache? Upcoming versions of y-cruncher will be able to utilize many levels of cache.
What about Intel-style MCDRAM (Xeon Phi) or HBM (Sapphire Rapids)? Those are the right idea, but perhaps too extreme. It doesn't need to be that big and latencies are a bit high. The other problem is that HBM can only act as a direct mapped cache for ram and thus has no associativity. This is very bad for FFT-like workloads.
Usefulness: 7
Difficulty to Implement: Medium
High endurance flash isn't that useful for record chasing since the capacities are not really large enough anyway. But it is very useful for development/testing of Swap Mode and more generally any sort of out-of-core application.
Hard drives are extremely slow, and getting slower relative to computional power. So development/testing of Swap Mode is extremely time-consuming. SSDs (especially NVMe) can make things much faster. But as I note here, they don't have the endurance to last very long. So my laboratory still uses hard drives for this purpose. SSDs are only for boot drives and other forms of "normal" usage.
So what we need is extremely high endurance flash. Something that can handle 100s of petabytes of writes.
This may already exist: Optane/3D XPoint. Intel refuses to give concrete numbers for endurance. (Even when I asked them this on stage at Hot Chips 30.) But the fact that they exist in DRAM form factor has to imply that it is significantly higher than normal flash. Another problem is that the capacities aren't really comparable to modern SSDs.
Another option is modern SLC SSDs. Modern SSDs have flash cells that are reliable enough to store multiple bits per cell (MLC). The problem is that MLC hurts endurance. Nevertheless, the industry is going all-in on MLC SSDs because they offer higher capacities. Is there enough of a market left for extremely high endurance SLC SSDs?
So technologically this is probably doable, but the economics doesn't allow it.
Most of the stuff here is meant for x86/64 with AVX512. But the concepts apply to any architecture.
You will notice that the caching instructions are generally ranked higher than all the computational ones in terms of usefulness. This is mostly a reflection of how memory-bound the project has become. Computational improvements are much less useful in the face of such a memory bottleneck.
Usefulness: 6
Difficulty to Implement: ???
Load 64 aligned bytes into a ZMM register. Evict the cacheline from all levels of cache or mark it as next to evict.
This doesn't even need to be a new instruction. Just change the behavior of vmovntdqa to this for normal (non-WC) memory.
One additional requirement is that the instruction has high burst throughput. One should be able to issue 20+ of them consecutively without it stalling execution (assuming the loads are hitting the L1 cache). In other words, the resulting background evictions shouldn't prevent the instruction from retiring.
(Update 2023: Intel has introduced the cldemote instruction for Sapphire Rapids. Does (load + cldemote) have the desired effect here? TBD...)
Motivation:
This is for streaming data you know won't be needed again for a while. So there's no point in keeping it cached. So kick it out right away so other cores don't need to steal ownership when they need it. Or mark it as next-to-evict from the current cache way.
There is a question of whether evicting it from cache is necessarily a good idea. It's possible there is another core (or even the hyperthread) that's also using the same data. Though I would assume such a scenario is rare.
Perhaps this instruction can be improved by adding an immediate specifying how many levels of cache to evict from.
Difficulty to Implement:
I don't know enough about cache design to competently judge how difficult this is to implement.
Usefulness: 4
Difficulty to Implement: ???
Load 64 aligned bytes into a ZMM register. Then either zero the data or mark it as "unspecified".
The subsequent "destruction" of the data is weakly ordered and ordered only with sfence and any lock-prefixed instruction.
Similar to the load-and-evict instruction above, load-and-destroy also needs to have high burst throughput as they will be used in the same manner.
Motivation:
The idea here is to tell the CPU that the data isn't needed anymore. Don't flush it back to memory even if it's dirty since that wastes memory bandwidth.
The usecase here is scratch memory. Cache-aware algorithms often have their own "cache regions" where the code manually pulls data from memory into a contiguous buffer, does some computation and writes it back. When it is done using the scratch memory, it doesn't care about the contents anymore. So release the memory as efficiently as possible. In a way, this is similar to TRIM for SSDs.
This instruction can be made even more useful by allowing the "destroy" part to be predicated. This will eliminate a ton of duplicate code.
Difficulty to Implement:
I don't know enough about cache design to competently judge how difficult this is to implement. But is this what AMD's clzero does?
Usefulness: 5
Difficulty to Implement: High
A prefetch instruction that prefetches an entire range of addresses rather than just one cache line.
Motivation:
The proposal here is intentionally underspecified since I'm not sure what the best approach is. But the idea here is to eliminate explicit prefetching from the inner loops of most cache-blocked code. Likewise it will attempt to solve the problem of prefetch pacing.
Often times in cache-blocked algorithms, the current loop is running on a cache-sized block of data while prefetching the next block. Prefetching is then interspersed (littered) into the loop. This often leads to messy code and even GPR register pressure for multi-stream workloads.
Other complications are:
- If the interspersed prefetching is too slow, the data won't be ready when it's needed in the next block.
- If the interspersed prefetching is too fast, it may block execution if the memory queues are full.
- If the computation is irregular and the next block is "too different" from the current block, it's difficult to find a prefetch scheme that works well.
It would be much easier to just tell the CPU directly, "I need this block of memory ASAP." The CPU can then pull it in at whatever pace the hardware can handle at that point of time. In other words, let the cache itself adaptively control the prefetch speed independently of the instruction execution. If it finishes the prefetch before the data is needed, cool. If not, performance will gracefully degrade depending on how far the hardware managed to get.
The biggest problem here is that many workloads have multiple data streams. Issuing multiple range prefetches which are done sequentially won't work if the prefetching can't keep up. If the start of any stream hasn't been prefetched by the time it is needed, the execution is still gonna stall.
So a single range prefetch is insufficient to communicate the intended access pattern to the processor. A typical FFT-like kernel will have 8 or more read streams which are accessed in a cyclic manner. How do I tell the processor this so that it can prefetch them in the order that I access them in?
Difficulty to Implement:
This requires some sort of microcode in the LSU to issue all the memory/cache requests. But other complications will arise.
Prioritization: Because a range prefetch is many prefetches in one, you can't have them block demand cache misses. So demand cache misses take priority. But this leads to another problem. What if there are so many demand cache misses that the prefetches get starved so long that the data isn't needed anymore? In other words, what you're prefetching has now been served as a demand cache miss. The intuitive answer is to cancel the prefetch when its cache line hits a demand miss. But with a range prefetch, there's an unboundedly large number of cache lines to keep track of.
Queuing: What happens if the code issues multiple range prefetches? How are those queued up? What if the queue is full?
Usefulness: 4
Difficulty to Implement: Easy
A prefetch instruction that takes:
- A Pointer (Memory Operand)
- A Threshold (GPR)
- An immediate specifying the type of prefetch.
It behaves like a normal SSE prefetch instruction, but if the address evaluates to a number greater than or equal to the threshold, the prefetch is suppressed and no request is made. Additional variants can be made for different types of comparisons.
Motivation:
In loops with prefetching, the code will be prefetching some # of iterations ahead. But when you get to the end of the loop, it will prefetch data that's not needed. In bandwidth-constrained workloads, these "out-of-bounds" prefetches hurt performance by consuming memory bandwidth.
The typical work-around is to peel the end of the loop and remove the prefetches. But this results in a lot of duplicated code - especially for large loop bodies. For example, y-cruncher has many inner-loop bodies that are hundreds to thousands of instructions long. While the code duplication in source code is easily avoided through template metaprogramming tricks, it still takes up instruction and uop cache space.
An alternate approach that I often use is to use a conditional move to set the prefetch pointer to something that's already in cache. This solves the problem of code duplication, but it adds overhead of not only the conditional moves, but also the requests into the LSU.
Difficulty to Implement:
A predicated prefetch should not be difficult to implement in hardware - though it may require an additional uop for the comparison.
Alternative approaches include:
- Read the flags at which a predicated prefetch becomes a cmp+prefetch pair - which could then be macro-op fused?
- Get rid of the comparison and directly use the lowest bit of a GPR. This is still useful for cache-blocked algorithms where the entire loop will be predicated as a whole rather than individual prefetches within the loop.
AVX512-PF indirectly supports predicated prefetching with the masking. (But this AVX512 variant dies with Xeon Phi.) And starting from Skylake, prefetches to null are suppressed. Unfortunately, setting to null doesn't work in software as only the first prefetch in a loop will be null while the rest will have offsets.
High 64x64-bit multiply (vpmulhq):
Usefulness: 6
Difficulty to Implement: Easy?
Same as vpmullq, but for the upper 64 bits. For each lane, multiply the 64-bit integers to form a 128-bit intermediate. Store the upper 64 bits in the destination.
Motivation:
- The 64-bit Number-Theoretic Transform (NTT) will benefit greatly from this as both the Shoup and Montgomery methods for multiply-modulus require an upper-half multiply.
- Implementations of the BBP algorithm that support 64-bit divisors need this for the Montgomery multiply-modulus.
- Bignum libraries that use full-word arithmetic may also benefit from this.
y-cruncher's currently emulates vpmulhq with about 12-14 instructions depending on whether the inputs are preconditioned - which is terrible.
Difficulty to Implement:
Currently, vpmullq is 3 uops - which is likely 3 uops to the double-precision FMA with custom normalization shifts.
The same approach will probably work to produce a 4-uop vpmulhq. 4-uops is still much better than the alternative of 12+ freaking macro-ops.
Usefulness: 5
Difficulty to Implement: Easy
Currently there are only two instructions in this family, vpmadd52luq and vpmadd52huq. The mere existence of these already enables a completely new class of integer algorithms - bignum and not. But they can be made more useful with the following additions:
- Give the full 231, 213, and 132 treatment that the floating-point FMAs enjoy.
- Add multiply-only versions of the instructions that are non-destructive.
- Add a negative-multiply add. The multiply remains unsigned, but rather than adding it to a 64-bit input, you subtract from it instead.
- Add signed-multiply variants.
- For all the IFMA instructions, add a shift to the multiplier determined by an immediate operand.
#1 and #2 will save a bunch of eliminate-able instructions like reg-reg movs and XOR-zeroing. It will also reduce register bandwidth pressure by reducing the number of input operands (helpful for AMD processors). #3 and #4 will have more material savings. #4 currently requires 4-6 instructions to emulate depending on whether the inputs are preconditioned.
Motivation:
As powerful as they are, the current vpmadd52luq and vpmadd52huq instructions seem overly specific for one usecase - the basecase multiply for RSA encryption. This particular usecase is very matrix-multiply-like and is nothing but multiply-adds that overwrite the accumulator.
In reality, the 52-bit multiply is useful for a lot of other things as well. But most attempts to use the current IFMA instructions for anything other than basecase multiply require working around the limitations of only having a destructive multiply-add.
- vpmadd52luq is a faster alternative to vpmullq when you know the inputs are non-negative and less than 252. Currently this requires zeroing the accumulator which is a reg-reg move or XOR-zero.
- Various operations in partial-word bignum implementations don't want to overwrite the accumulator. The result is tons of reg-reg moves.
- Long division involves subtractions. This would greatly benefit from a negative-multiply-add or a signed-multiply-add. In the cases where the sign of the division quotient is known statically, the overhead is "just" a bunch of reg-reg moves and subtractions. But when the signs are not known, there is no good solution.
- When using bases less than 252, there needs to be shifting. A very good explanation can be found here.
In particular, the signed-multiply will open up a new dimension to partial-word bignums by allowing efficient handling of balanced representations (representations where the words or "limbs" can be negative). In fact, a number of my research experiments have failed due to the need for balanced representations and the lack of an efficient way to multiply them.
For now, there's no obvious need for any of the multiply-subtract variants. Sign propagation generally eliminates the need for these. The same applies to like half the floating-point FMA instructions. But they still exist anyway...
Difficulty to Implement:
#1, #2, and #3 should be trivial to implement. Not sure about #4, but I don't anticipate that to be difficult either.
The shifting (#5) can be absorbed into the shift-normalization logic that the floating-point FMAs already have.
Here's a possible specification for the proposed new instructions:
Operation Type Name Assembly Operation Multiply Only
(non-destructive)
Low: vpmul52luq vpmul52luq zmm0, zmm1, zmm2, imm zmm0 = umullo52(zmm1, zmm2) * 2imm Unsigned High: vpmul52huq vpmul52huq zmm0, zmm1, zmm2, imm zmm0 = umulhi52(zmm1, zmm2) * 2imm Signed High: vpmul52hq vpmul52hq zmm0, zmm1, zmm2, imm zmm0 = smulhi52(zmm1, zmm2) * 2imm Positive Multiply Add
Low: vpmadd52l132uq vpmadd52l213uq
vpmadd52l231uq
vpmadd52l132uq zmm0, zmm1, zmm2, imm
vpmadd52l213uq zmm0, zmm1, zmm2, imm
vpmadd52l231uq zmm0, zmm1, zmm2, imm
zmm0 = zmm1 + umullo52(zmm0, zmm2) * 2imm
zmm0 = zmm2 + umullo52(zmm0, zmm1) * 2imm
zmm0 = zmm0 + umullo52(zmm1, zmm2) * 2imm
Unsigned High: vpmadd52h132uq
vpmadd52h213uq
vpmadd52h231uq
vpmadd52h132uq zmm0, zmm1, zmm2, imm
vpmadd52h213uq zmm0, zmm1, zmm2, imm
vpmadd52h231uq zmm0, zmm1, zmm2, imm
zmm0 = zmm1 + umulhi52(zmm0, zmm2) * 2imm
zmm0 = zmm2 + umulhi52(zmm0, zmm1) * 2imm
zmm0 = zmm0 + umulhi52(zmm1, zmm2) * 2imm
Signed High: vpmadd52h132q
vpmadd52h213q
vpmadd52h231q
vpmadd52h132q zmm0, zmm1, zmm2, imm
vpmadd52h213q zmm0, zmm1, zmm2, imm
vpmadd52h231q zmm0, zmm1, zmm2, imm
zmm0 = zmm1 + smulhi52(zmm0, zmm2) * 2imm
zmm0 = zmm2 + smulhi52(zmm0, zmm1) * 2imm
zmm0 = zmm0 + smulhi52(zmm1, zmm2) * 2imm
Negative Multiply Add
Low: vpnmadd52l132uq
vpnmadd52l213uq
vpnmadd52l231uq
vpnmadd52l132uq zmm0, zmm1, zmm2, imm
vpnmadd52l213uq zmm0, zmm1, zmm2, imm
vpnmadd52l231uq zmm0, zmm1, zmm2, imm
zmm0 = zmm1 - umullo52(zmm0, zmm2) * 2imm
zmm0 = zmm2 - umullo52(zmm0, zmm1) * 2imm
zmm0 = zmm0 - umullo52(zmm1, zmm2) * 2imm
Unsigned High: vpnmadd52h132uq
vpnmadd52h213uq
vpnmadd52h231uq
vpnmadd52h132uq zmm0, zmm1, zmm2, imm
vpnmadd52h213uq zmm0, zmm1, zmm2, imm
vpnmadd52h231uq zmm0, zmm1, zmm2, imm
zmm0 = zmm1 - umulhi52(zmm0, zmm2) * 2imm
zmm0 = zmm2 - umulhi52(zmm0, zmm1) * 2imm
zmm0 = zmm0 - umulhi52(zmm1, zmm2) * 2imm
Signed High: vpnmadd52h132q
vpnmadd52h213q
vpnmadd52h231q
vpnmadd52h132q zmm0, zmm1, zmm2, imm
vpnmadd52h213q zmm0, zmm1, zmm2, imm
vpnmadd52h231q zmm0, zmm1, zmm2, imm
zmm0 = zmm1 - smulhi52(zmm0, zmm2) * 2imm
zmm0 = zmm2 - smulhi52(zmm0, zmm1) * 2imm
zmm0 = zmm0 - smulhi52(zmm1, zmm2) * 2imm
Notes:
- umullo52(A, B): Take the bottom 52 bits of A and B as unsigned 52-bit integers. Multiply together to form a 104-bit product. Return lower 52 bits.
- umulhi52(A, B): Take the bottom 52 bits of A and B as unsigned 52-bit integers. Multiply together to form a 104-bit product. Return upper 52 bits.
- smulhi52(A, B): Take the bottom 52 bits of A and B as signed 52-bit integers (two's complement). Multiply together to form a product as a 104-bit two's complement integer. Arithmetic right shift this 104-bit integer by 52 bits. Return the lower 64 bits.
- It's unspecified what should happen if imm is greater than 12. In reality, we don't care since it won't happen intentionally. So whatever is most reasonable or easiest to implement in hardware will work.
- Also unspecified is whether to treat imm as a signed integer and allow negative shifts (or a right shift). I don't see a usecase for this so it's probably not necessary.
Faster AVX512 kshift and kadd:
Usefulness: 3
Difficulty to Implement: Potentially Difficult
All mask instructions which have cross-bit dependencies have 4 cycle latency on Skylake X. This is really slow. Can this be brought down?
(Update 2022: Centaur CNS and AMD Zen4 have both implemented AVX512 with 1-cycle latency for mask instructions. But more investigation is needed to see if these are true 1-cycle latency or if the overhead is being moved elsewhere such as domain switching delays.)
Motivation:
y-cruncher's benefit from this is marginal since it's not the performance bottleneck. But other bignum libraries may get much more.
Difficulty to Implement:
This may actually be more difficult than it looks due the design of the k registers. These mask registers are fast when accessed with 32-bit or 64-bit granular AVX512 instructions. This implies that each individual bit of the mask is reasonably close to its respective lane in the SIMD unit. But lane-crossing mask instructions (like kshift and kadd) have cross-lane dependencies. Given the size of the SIMD execution units and the distances between them, the mask bits would need to travel quite far on the die to "meet".
Usefulness: 2
Difficulty to Implement: Easy
Let's call Adjacent Lane Permutes "ALPs" for short. This topic is big enough to write an entire blog. But until that happens, a short summary will have to do.
Given a lane size (32 bits, 64 bits, 128 bits, etc...), allow arbitrary permutations within adjacent lanes from two different input vectors. There are 4 types of adjacent lane permutes that are needed.
Input A: [A0, A1][A2, A3][...]
Input B: [B0, B1][B2, B3][...]
Name Description Returns Low Grab the lower element in each pair of adjacent lanes from both inputs. [A0, B0][A2, B2][...] High Grab the upper element in each pair of adjacent lanes from both inputs. [A1, B1][A3, B3][...] Blend For each pair of lanes, grab the lower element from input A and the upper element from input B. [A0, B1][A2, B3][...] Cross For each pair of lanes, grab the upper element from input A and the lower element from input B. [A1, B0][A3, B2][...] Each lane size will need all 4 of these. On AVX512, there are 6 lane sizes going down to byte granularity: 256, 128, 64, 32, 16, and 8.
(Though I've personally never had a need to go below 32-bit granularity.)
vpunpcklqdq and vpunpckhqdq are examples of 64-bit ALPs. (low/high variants)
vperm2f128 and vperm2i128 are examples of 128-bit ALPs. (all 4 variants, but no 512-bit version exists)
vshufpd is a 64-bit ALP. (all 4 variants)
vpunpckldq and vpunpckhdq are not ALPs because they don't pull from adjacent 32-bit lanes.
vshuff64x2 and vshufi64x2 can do 256-bit ALPs (all 4 variants), but they can't do any of the 128-bit ALPs because they can't pull from adjacent lanes.
vinsertf64x4 and vinserti64x4 are both 256-bit ALPs. (low and blend variants)
In short, some of the ALPs already exist. But many are missing. The missing ones currently need to be emulated.
Motivation:
Adjacent lane permutes are a set of building blocks for efficiently performing arbitrary dimension SIMD transposes such as 8x5, 16x13, etc... These are critical for efficiently performing AOS <-> SOA conversions which are subsequently critical for vectorizing a lot of code that is otherwise not vectorizable. Thus, ALPs allow odd-sized transposes to be done significantly faster than both gather/scatter and scalar code.
On Skylake X with AVX512-F, all ALPs of 32-bit granularity or larger can be done with a single 1-uop instruction. 16-bit and 8-bit granularities will require up to 2 or 3 uops respectively. Cannon Lake with AVX512-VBMI brings the 16-bit and 8-bit granularities down to 1 and 2 uops respectively.
But in all cases, there is a substantial amount of overhead that includes:
- Permute vectors.
- Mask constants.
- Reg-reg moves due to the destructiveness of 2-input permutes and embedded blend-merging.
- Type-casts due to some permutes existing in one type, but not the others. (Mostly a usage inconvenience in C/C++ with intrinsics.)
Mask register pressure also becomes a problem for smaller granularity transposes.
Difficulty to Implement:
The nature of the ALPs is that data locality increases with smaller granularities. So it is not necessary to have the expensive all-to-all permute hardware. It is possible to implement all of the ALPs with only O(N log(N)) transistors and routing traffic (where N is the bitlength of the SIMD vector).
Since AVX512 already has single-uop all-to-all permutes, it should be trivial to implement all the missing ALPs with single-uop instructions. The smaller granular ALPs aren't efficient this way, but since they have high locality, they are likely doable with minimal extra hardware as single-cycle latency instructions.
Overall, I rate the "usefulness" of native ALPs as only a 2 because:
But since they are theoretically easy to implement in hardware, it's worth asking for them in some future processor.
- Data shuffling is generally not a performance bottleneck because it's usually possible to optimize it out of most critical compute kernels.
- All of the ALPs are either already directly supported or can be emulated in AVX512 with reasonable efficiency.
Not everything needs to be a hardware improvement. Software improvements are possible too (and easier to implement).
Intrinsic for Complex Addressing:
Usefulness: 5
Difficulty to Implement: Very Easy
A common pattern in FFT-like workloads is strided memory access. For example, a radix 8 butterfly will have 8 streams or more memory streams:
- 8 streams if in-place.
- 16 streams if out-of-place.
- 24 streams if out-of-place + prefetching in the future.
Regardless of whether these streams are represented as pointers or offset indices, they all run into the same problem - register pressure. x64 simply does not have enough general purpose registers to adequately suit these types of access patterns.
There is one trick solves this - a method I call "pointer folding". Say you want to access 8 streams. Thus you want to create the effect of:
__m512d* T0 = T + 0*stride;
__m512d* T1 = T + 1*stride;
__m512d* T2 = T + 2*stride;
__m512d* T3 = T + 3*stride;
__m512d* T4 = T + 4*stride;
__m512d* T5 = T + 5*stride;
__m512d* T6 = T + 6*stride;
__m512d* T7 = T + 7*stride;
for (...){
__m512d r0 = T0[0];
__m512d r1 = T1[0];
__m512d r2 = T2[0];
__m512d r3 = T3[0];
__m512d r4 = T4[0];
__m512d r5 = T5[0];
__m512d r6 = T6[0];
__m512d r7 = T7[0];
// Do work
T0 += 1;
T1 += 1;
T2 += 1;
T3 += 1;
T4 += 1;
T5 += 1;
T6 += 1;
T7 += 1;
}
But if you don't have 8 available registers to hold the pointers, you can fold them into 4 registers as follows:
char* T0 = (char*)(T + 0*stride);
char* T7 = (char*)(T + 7*stride);
size_t p = stride * sizeof(__m512d);
size_t n = -stride;
for (...){
__m512d r0 = _mm512_load_pd(T0);
__m512d r1 = _mm512_load_pd(T0 + 1*p);
__m512d r2 = _mm512_load_pd(T0 + 2*p);
__m512d r3 = _mm512_load_pd(T7 + 4*n);
__m512d r4 = _mm512_load_pd(T0 + 4*p);
__m512d r5 = _mm512_load_pd(T7 + 2*n);
__m512d r6 = _mm512_load_pd(T7 + 1*n);
__m512d r7 = _mm512_load_pd(T7);
// Do work
T0 += 64;
T7 += 64;
}
Subsequently, the start of the loop would ideally compile to something like this:
vmovapd zmm0, zmmword ptr [T0]
vmovapd zmm1, zmmword ptr [T0 + 1*p]
vmovapd zmm2, zmmword ptr [T0 + 2*p]
vmovapd zmm3, zmmword ptr [T7 + 4*n]
vmovapd zmm4, zmmword ptr [T0 + 4*p]
vmovapd zmm5, zmmword ptr [T7 + 2*n]
vmovapd zmm6, zmmword ptr [T7 + 1*n]
vmovapd zmm7, zmmword ptr [T7]Thus it is now possible to do 16 or even 24 streams with only 16 general purpose registers. Great! Furthermore, this approach can be trivially extended to up to 12 streams of equal stride.
Well not so fast. The problem is that compilers won't actually let this happen. When a compiler sees a common subexpression such as (4*n), it will try to optimize it out. This leads to more "live values" and thus more registers which leads to crazy amounts of register spilling.
The only compiler I've seen that does a decent job of preserving the intent of the above code is the Intel Compiler Classic (ICC). But that compiler is deprecated and its replacement, Intel LLVM Compiler (ICX), is drastically inferior in this aspect (among other things).
Proposed Solution:
Add an intrinsic for Sib (complex addressing). So maybe one day we can write it like this:
char* T0 = (char*)(T + 0*stride);
char* T7 = (char*)(T + 7*stride);
size_t p = stride * sizeof(__m512d);
size_t n = -stride;
for (...){
__m512d r0 = _mm512_load_pd(T0);
__m512d r1 = _mm512_load_pd(_sib(T0, p, 1, 0)); // T0 + 1*p + 0
__m512d r2 = _mm512_load_pd(_sib(T0, p, 2, 0)); // T0 + 2*p + 0
__m512d r3 = _mm512_load_pd(_sib(T7, n, 4, 0));
__m512d r4 = _mm512_load_pd(_sib(T0, p, 4, 0));
__m512d r5 = _mm512_load_pd(_sib(T7, n, 2, 0));
__m512d r6 = _mm512_load_pd(_sib(T7, n, 1, 0));
__m512d r7 = _mm512_load_pd(T7);
// Do work
T0 += 64;
T7 += 64;
}
Funny enough, the gather/scatter SIMD intrinsics already support this. We just need to extend the functionality to regular load/stores.
Nice Things that have Happened:
Sometimes, we do get nice things - but only sometimes...
Here's all the things I've wanted in the past that eventually came to life! This list omits all the things that turned out to be useful after they had been announced.
64-bit Arithmetic Right-Shift:
Usefulness: 3
Launched: XOP (2011), AVX512 (2016)
Instruction(s): vpshaq, vpsraq
The 64-bit arithmetic shift was one of the glaring omissions of the entire SSE line. AMD XOP instruction finally supported it in 2011. But XOP is a dead-end instruction set. So this wasn't realistically added until AVX512. And we had to wait until Skylake X in 2017 to get it for real since Xeon Phi never went mainstream.
In most cases, this could be emulated with 3 instructions.
64-bit Unsigned Integer Compare:
Usefulness: 3
Launched: XOP (2011), AVX512 (2016)
Instruction(s): vpcomuq, vpcmpuq
Same as the 64-bit arithmetic shift in every single way. Also could be emulated with 3 instructions.
64-bit Integer <-> Double Conversion:
Usefulness: 2
Launched: AVX512-DQ (2017)
Instruction(s): vcvtqq2pd, vcvtuqq2pd, vcvtpd2uqq, vcvtpd2qq, vcvttpd2qq, vcvttpd2uqq
Yet another glaring omission from the SSE line. Fortunately most of these could be emulated with 2 instructions for the common use cases.
Usefulness: 2
Launched: XOP (2011), AVX2 (2013)
Instruction(s): vpsh*, vpsllv*, vpsrlv*, vpsrav*
Another annoying omission from the SSE line. Fortunately, y-cruncher only ever needed the 64-bit versions which are the cheapest to emulate.