Internals - Integer Addition

By Alexander Yee

 

(Last updated: February 23, 2017)

 

Back To:

 

Large integer addition and carryout have relatively little effect on performance in an environment dominated by super-linear operations like large integer multiplication.

But nevertheless, it is still necessary to deal with them. And there are a number of ways to do addition with varying levels of complexity and trade-offs.

 

 

Basic Addition

 

y-cruncher's internal representation of large integers is described here.

 

Adding and subtracting integers in that representation is the same as what is taught in grade-school. But instead of base 10, it's base 232 or 264. For each word, add them together and carry the overflow to the next word. This is pretty standard so I won't go into the boring details. The interesting parts are the implementation.

 

 

Language Level: C and C++

 

In C, C++, and most other C-like languages, there is no easy way to determine if adding two integers results in an overflow. For 32-bit words, you can upcast the operands to a 64-bit integer which will preserve the carryout bit. And for 64-bit words, you can break them into 32-bit words and do it twice.

 

The problem with this "half-word" approach is that it ends up wasting half the hardware adder logic. Unfortunately, there are no ways around this in standard C/C++.

 

 

Hardware Specific: x86 and x64

 

On x86 processors, things are much better since there is hardware support for this. The add instruction sets the carry-bit flag when the sum overflows.

The adc instruction (add-with-carry) does the same thing, except that it also adds in the carry-bit flag to the result.

 

So adding a pair of large integers in the standard full-word representation is as simple as chaining the adc instruction.

add     rax, rbx;

adc     rax, rcx;

adc     rax, rdx;

adc     rax, rdi;

Actual code will involve streaming data from memory. An out-of-place add will look something like this:

mov     rax, QWORD PTR [r8];

add     rax, QWORD PTR [r9];

mov     QWORD PTR [r10], rax;

 

mov     rax, QWORD PTR [r8 + 8];

adc     rax, QWORD PTR [r9 + 8];

mov     QWORD PTR [r10 + 8], rax;

 

mov     rax, QWORD PTR [r8 + 16];

adc     rax, QWORD PTR [r9 + 16];

mov     QWORD PTR [r10 + 16], rax;

Broadwell processors support the ADX instruction set which adds the adcx and adox instructions. These use different flag bits which allow for 2 independent carry-chains to happen simultaneously. While this is less important for addition, it's a big deal for basecase multiplication.

 

Until recently (~2015-ish), there was no way access the adc instruction when programming in C/C++ aside from inline assembly. But most major compilers now support intrinsics that will generate them. Though it's worth noting that compilers nowadays vary greatly in their ability to optimize add-with-carry intrinsics.

Because no compiler to date is able to generate quality code for the adcx and adox instructions, y-cruncher (as of v0.7.1) still relies heavily on inline assembly.

 

 

Parallel Addition

 

Since bignum addition is computationally cheap, it is generally memory-bound. So y-cruncher has never parallelized bignum addition since it would give little to no benefit.

 

However, things are changing now. On modern massively multi-core hardware, it is impossible for a single thread to utilize all the memory bandwidth. Combine that with Amdahl's Law and we see that sequential bignum addition is starting to become a liability.

 

As of v0.7.1, y-cruncher does not do parallel addition. But it will in some future version.

 

 

The Algorithm:

 

The carry-chain in bignum addition has one big problem: It's a dependency chain. And that's a bad thing when attempting to parallelize.

 

Each word depends on the carryout from the previous word. So it must wait until the previous add is done. So addition is inherently a sequential task. And there is no parallelism that can be exploited at all. Or is there?

 

Fortunately, there is a way to break the chain. Addition is associative, you can change the order you do them. Which means it's possible to ignore a carry-in until later. So the idea is to break the operands into smaller blocks, add them in parallel, then sort out the carryout.

 

Consider the following example in base 10:

A = 4453154504161422340178736208899939126959165670131031842194475823

B = 9720295491078215560285912369089892390215253790336986963019504972

Break it up into blocks of 16 digits each:

A = 4453154504161422 3401787362088999 3912695916567013 1031842194475823

B = 9720295491078215 5602859123690898 9239021525379033 6986963019504972

Add each block independently (in parallel):

C = (1)4173449995239637 9004646485779897 (1)3151717441946046 8018805213980795

Deal with the carryout:

C = 1 4173449995239637 9004646485779897+1 3151717441946046 8018805213980795

Combine to produce final sum:

C = 14173449995239637900464648577989831517174419460468018805213980795

 

Pathological Carryout:

 

On average, the expensive part is the addition of the independent blocks. This is good since it can be done in parallel. The following carryout is computationally negligible.

So for an N-digit number and p processors, the running time is O(N/p) - which is perfect.

 

That is the best case, which will also be the average case for random inputs. But unfortunately, it is possible for the carryout to propagate through the entire number as in this example:

A = 9999999999999999999999999999999999999999999999999999999999999999

B = 0000000000000000000000000000000000000000000000000000000000000001

 

A = 9999999999999999 9999999999999999 9999999999999999 9999999999999999

B = 0000000000000000 0000000000000000 0000000000000000 0000000000000001

 

C = 9999999999999999 9999999999999999 9999999999999999 (1)0000000000000000

C = 10000000000000000000000000000000000000000000000000000000000000000

Since the carryout is sequential and spans all N digits, the worst-case running time is O(N) - which is the same as not parallelizing at all. And it is actually worse since you've now made two passes over the dataset.

 

As of v0.7.1, y-cruncher does not parallelize addition. But pathological carryout is still relevant since all the FFT-based large multiplication algorithms do some sort of parallel carryout which will run into the same pathological worst-cases. For computations of Pi where the data is more or less random, pathological carryout simply does not happen. So while y-cruncher will handle them correctly, it will only do so with O(N) time complexity.

 

It's worth noting that Newton's Method iterations (for division and square root) are more prone to pathological carryout even for random inputs. This is because one of the multiplications will lead to a number of the form, "0.9999xxxxxxxx" or "1.0000xxxxxxxx". But y-cruncher's implementation doesn't have this problem since the string of consecutive 9's or 0's are eliminated by the middle-product optimization. For more details, see the page on division and square root.

 

 

N x 1 Multiplication:

 

The parallel addition algorithm above can be trivially extended to parallelize a multiplication of a large integer and a machine word. The same problems with pathological inputs also apply.

 

Again y-cruncher currently does not parallelize N x 1 multiplication for the same reasons as addition. But it will at some point in the future.

 

 

 

Kogge-Stone Parallel Addition

 

In order to achieve sub-linear worst-case, we need a way to efficiently handle all inputs. And for that, we borrow a page from hardware academia.

 

The Kogge-Stone Adder is a parallel prefix adder that's commonly implemented in hardware. The algorithm can perform an addition of two N-bit integers with a guaranteed latency of O(log(N)). And it does it for all inputs including pathological ones.

 

I won't go into the details of the Kogge-Stone Adder since that's what Wikipedia is for. But despite it being a hardware algorithm, it can used at a high-level to guarantee O(N/p) running time for all inputs.

 

The original Kogge-Stone Adder is a recursive radix-2 algorithm with a depth of lg(N). We won't be using that. Instead, a two-level algorithm with radices of N/p and p will suffice for the purpose of y-cruncher. As of this writing, y-cruncher isn't able to parallelize to the point where p becomes excessively large.

 

The best way to explain the approach is by example:

A = 4453154504161422340178736208899939126959165670131031842194475823

B = 9720295495838578560285913791100060873041253790338968157819504972

Break it up into p blocks:

A = 44531545 04161422 34017873 62088999 39126959 16567013 10318421 94475823

B = 97202954 95838578 56028591 37911000 60873041 25379033 89681578 19504972

Compute P[i] = A[i] + B[i] ignoring overflow: (can be done in parallel)

P = 141734499 00000000 90046464 99999999 00000000 41946046 99999999 13980795

Compute bitmasks C[i] = true if A[i] + B[i] overflows, and M[i] = true if A[i] + B[i] is max block: (can be done in parallel)

C = 0 1 0 0 1 0 0 1

M = 0 0 0 1 0 0 1 0

At this point:

In fact each block in P is either correct or can be made correct by adding 1 (and ignoring overflow). The final step is to figure out which blocks we need to add 1 to. Fortunately, this can be done solely by examining the masks C and M. So we've effectively reduced the problem from N bits of data to p bits of data.

 

The remaining steps are as follows:

  1. Shift C up by 1 bit. This is because the carry goes to the next block.
  2. Iterate the bits of C and M from the bottom up:
    1. Index 0:
      • If C[0] == 0, then P[0] is correct and we move on.
      • If C[0] == 1 and M[1] == 0, then P[0] is wrong and we need to add 1.
      • If C[0] == 1 and M[1] == 1, then P[0] is wrong as we need to add 1. Furthermore we need to set carry to the next bit.
    2. Index 1:
      1. If C[1] == 0 and M[1] == 0 and carry == 0, then P[1] is correct and we move on.
      2. If C[1] == 0 and M[1] == 0 and carry == 1, then P[1] is wrong and we need to add 1.
      3. If C[1] == 0 and M[1] == 1 and carry == 0, then P[1] is correct and we move on.
      4. If C[1] == 0 and M[1] == 1 and carry == 1, then P[1] is wrong and we need to add 1. Furthermore we need to set carry to the next bit.
      5. If C[1] == 1 and M[1] == 0 and carry == 0, then P[1] is wrong and we need to add 1.
      6. If C[1] == 1 and M[1] == 0 and carry == 1, this case is mathematically impossible.
      7. If C[1] == 1 and M[1] == 1 and carry == 0, then P[1] is wrong and we need to add 1. Furthermore we need to set carry to the next bit.
      8. If C[1] == 1 and M[1] == 1 and carry == 1, this case is mathematically impossible.
    3. Repeat for all the indices.

Finishing Up:

C  = 0 1 0 0 1 0 0 1

C' = 1 0 0 1 0 0 1 0    (C shifted left by 1)

M  = 0 0 0 1 0 0 1 0

 

Blocks to Correct: 1 0 1 1 0 1 1 0

 

P = 141734499 00000000 90046464 99999999 00000000 41946046 99999999 13980795

S = 141734500 00000000 90046465 00000000 00000000 41946047 00000000 13980795

 

A + B = 14173450000000000900464650000000000000000419460470000000013980795

 

When we need to add 1 to a block, it has the potential to carry through the entire block taking up O(N/p) time. But each block is independent so they can all be done in parallel. The iteration to determine which blocks to add 1 is not parallelizable due to the carry logic. But it covers only p bits of data - which is negligible compared to the original N bits. So in the end the complexity of adding two N-bit integers with p processors remains O(N/p) in time.

 

If you look closer at the iteration over the C and M bits, you'll notice that each step is equivalent to a full-adder. In other words, the entire iteration is just an integer addition of p-bit integers. Therefore, it is a recursive algorithm. So if the p-bit addition is too large, it can be done using Kogge-Stone again.

 

 

Kogge-Stone Vector Addition

 

Background:

 

One of the many y-cruncher side-projects was to see if the Kogge-Stone adder can be done efficiently at a low level using SIMD. Guess what? The answer is yes.

In other words, it is possible to add together a vector type (like __m512i) as if it were a 512-bit integer.

 

At first, this may seem like a significant breakthrough in bignum arithmetic. But in reality, it's not really that useful:

  1. Applications that target small and medium-sized numbers benefit more from partial word representations which are more efficient to vectorize. This is going to be a big deal with Cannonlake processors thanks to the AVX512-IFMA instruction set.

  2. Applications that target large numbers spend all their time doing large multiplications. For example, y-cruncher spends less than 0.5% of the CPU time doing bignum addition for a typical computation.

  3. The approach presented here has enough overhead that it needs AVX512 to outperform a chain of add-with-carry instructions on x64. This limits its usability.

So the most viable place to use vectorized addition is in legacy code that is "stuck" with full-word representations.

 

Another possibility for the vectorized add is the Schönhage-Strassen Algorithm (SSA) which consists largely of additions and subtractions. But the benefits of that are quite limited since the performance of SSA has fallen so far behind other (vectorizable) algorithms that a Kogge-Stone add is unlikely to save it.

 

 

AVX512-DQ:

 

Below is C++ intrinsic code for a 512-bit full-adder. The carry parameter indicates the carry-in and is replaced with the carry-out. It must be either 0 or 1.

__m512i add512(__mmask16& carry, __m512i A, __m512i B){

    const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

 

    __m512i s = _mm512_add_epi64(A, B);

    __mmask16 c = _mm512_cmplt_epu64_mask(s, A);

    __mmask16 m = _mm512_cmpeq_epi64_mask(s, MAX_WORD);

 

    {

        c <<= 1;

        c = _mm512_kor(c, carry);

        c += m;

        carry = c >> 8;

        m = _mm512_kxor(m, c);

    }

 

    return _mm512_mask_sub_epi64(s, m, s, MAX_WORD);

}

For maximum efficiency, all the mask operations in the middle brackets should be done entirely in the AVX512 mask registers. Unfortunately, both GCC and the Intel Compiler have trouble doing that and have a tendency to move it into the general purpose registers.

 

The only AVX512-DQ instruction here is the c += m which ideally compiles to a kaddw instruction. The kadd instructions seemed weird when first introduced since it didn't make any sense to add masks. But apparently, this is a legitimate use-case.

 

This routine has 9 instructions. When streaming through memory, the load/stores make it 11 instructions. When chaining these, the critical path is only 3 of the mask instructions in the middle. So if we assume that mask instructions have single-cycle latency and at least 2/cycle throughput on Skylake Purley, this routine can reach a maximum of 3 cycles/512 bits. By comparison, the classic ADC-chaining implementation can do no better than 8 cycles/512 bits.

 

If you don't care about carry-in and carry-out, two instructions can be removed which reduces it to 7 if you're working off registers.

 

It remains to be seen if this approach will actually beat ADC-chaining. My personal guess is yes. And it's worth mentioning that this method has plenty of room to scale in the future should AVX512 grow to even wider vectors.

 

 

AVX512-F:

 

A small tweak to the AVX512-DQ approach will make it work on AVX512-F for Knights Landing with a couple extra instructions.

__m512i add512(uint32_t& carry, __m512i A, __m512i B){

    const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

 

    __m512i s = _mm512_add_epi64(A, B);

    __mmask16 c = _mm512_cmplt_epu64_mask(s, A);

    __mmask16 m = _mm512_cmpeq_epi64_mask(s, MAX_WORD);

 

    {

        uint32_t c0 = _mm512_mask2int(c);

        uint32_t m0 = _mm512_mask2int(m);

        carry += m0;

        carry = (carry + c0*2); //  lea

        m0 ^= carry;

        carry >>= 8;

        m = _mm512_int2mask(m0);

    }

 

    return _mm512_mask_sub_epi64(s, m, s, MAX_WORD);

}

Assuming Knights Landing, this replaces all the 2-cycle mask instructions (and the non-existant kaddw) with 1-cycle GPR instructions and allows the shift-or to be merged into a lea. But this adds 3 high-latency kmovw instructions. The critical path remains 3-cycles, but Knights Landing doesn't have the execution width to do better than 5 cycles/512 bits.

 

Unlike the AVX512-DQ version, this one has no trouble compiling efficiently on both ICC and GCC. In fact, this one may even be faster if the AVX512-DQ version gets compiled poorly.

 

 

 

AVX2:

 

Just for kicks, here's an AVX2 implementation. The lack of unsigned compare and mask-to-vector instructions significantly complicates things:

const __m256i BROADCAST_MASK[16] = {

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000001, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000001, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000000, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000000, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000001, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000001, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000000, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000001, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000001, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000000, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000000, 0x8000000000000001),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000001, 0x8000000000000000),

    _mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000001, 0x8000000000000001),

};

 

__m256i add256(uint32_t& carry, __m256i A, __m256i B){

    A = _mm256_xor_si256(A, _mm256_set1_epi64x(0x8000000000000000));

 

    __m256i s = _mm256_add_epi64(A, B);

    __m256i cv = _mm256_cmpgt_epi64(A, s);

    __m256i mv = _mm256_cmpeq_epi64(s, _mm256_set1_epi64x(0x7fffffffffffffff));

    uint32_t c = _mm256_movemask_pd(_mm256_castsi256_pd(cv));

    uint32_t m = _mm256_movemask_pd(_mm256_castsi256_pd(mv));

 

    {

        c = m + 2*c; //  lea

        carry += c;

        m ^= carry;

        carry >>= 4;

        m &= 0x0f;

    }

 

    return _mm256_add_epi64(s, BROADCAST_MASK[m]);

}

This monstrosity compiles to 13 instructions (14 when streaming through memory). The initial XOR and the tweaking of the constants are work-arounds for the lack of an unsigned integer compare. The lookup table is a work-around for the lack of a mask-to-vector instruction.

 

You'd think that this approach has absolutely no chance of beating a chain of 4 adc instructions. But in a loop streaming through memory (fitting in L1 cache), this method using just AVX2 is only about 10% slower on both Haswell and Skylake!

 

This is why I think the AVX512-DQ version will win on Skylake Purley. If this thing comes close to beating a chain of 4 adc instructions, there's a good chance the AVX512-DQ version (which is more efficient) will beat a chain of 8 of them.

 

 


For the sake of completeness, here are the full-subtractors. They are the same except for a few sign-flips and tweaks of the constants.

Again the carry-in must be either 0 or 1. 1 indicates a borrow in.

 

 

AVX512-DQ:

__m512i sub512(__mmask16& carry, __m512i A, __m512i B){

    const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

 

    __m512i s = _mm512_sub_epi64(A, B);

    __mmask16 c = _mm512_cmpgt_epu64_mask(s, A);

    __mmask16 m = _mm512_cmpeq_epi64_mask(s, _mm512_setzero_si512());

 

    {

        c <<= 1;

        c = _mm512_kor(c, carry);

        c += m;

        carry = c >> 8;

        m = _mm512_kxor(m, c);

    }

 

    return _mm512_mask_add_epi64(s, m, s, MAX_WORD);

}

 

AVX512-F:

__m512i sub512(uint32_t& carry, __m512i A, __m512i B){

    const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

 

    __m512i s = _mm512_sub_epi64(A, B);

    __mmask16 c = _mm512_cmpgt_epu64_mask(s, A);

    __mmask16 m = _mm512_cmpeq_epi64_mask(s, _mm512_setzero_si512());

 

    {

        uint32_t c0 = _mm512_mask2int(c);

        uint32_t m0 = _mm512_mask2int(m);

        carry += m0;

        carry = (carry + c0*2); //  lea

        m0 ^= carry;

        carry >>= 8;

        m = _mm512_int2mask(m0);

    }

 

    return _mm512_mask_add_epi64(s, m, s, MAX_WORD);

}

 

AVX2:

 

Using the same lookup table as the full-adder...

__m256i sub256(uint32_t& carry, __m256i A, __m256i B){

    A = _mm256_xor_si256(A, _mm256_set1_epi64x(0x8000000000000000));

 

    __m256i s = _mm256_sub_epi64(A, B);

    __m256i cv = _mm256_cmpgt_epi64(s, A);

    __m256i mv = _mm256_cmpeq_epi64(s, _mm256_set1_epi64x(0x8000000000000000));

    uint32_t c = _mm256_movemask_pd(_mm256_castsi256_pd(cv));

    uint32_t m = _mm256_movemask_pd(_mm256_castsi256_pd(mv));

 

    {

        c = m + 2*c; //  lea

        carry += c;

        m ^= carry;

        carry >>= 4;

        m &= 0x0f;

    }

 

    return _mm256_sub_epi64(s, BROADCAST_MASK[m]);

}