(Last updated: October 22, 2018)

**Back To:**

Large integer addition and carryout have relatively little effect on performance in an environment dominated by super-linear operations like large integer multiplication.

But nevertheless, it is still necessary to deal with them. And there are a number of ways to do addition with varying levels of complexity and trade-offs.

y-cruncher's internal representation of large integers is described here.

Adding and subtracting integers in that representation is the same as what is taught in grade-school. But instead of base 10, it's base 2^{32} or 2^{64}. For each word, add them together and carry the overflow to the next word. This is pretty standard so I won't go into the boring details. The interesting parts are the implementation.

**Language Level: C and C++**

In C, C++, and most other C-like languages, there is no easy way to determine if adding two integers results in an overflow. For 32-bit words, you can upcast the operands to a 64-bit integer which will preserve the carryout bit. And for 64-bit words, you can break them into 32-bit words and do it twice.

The problem with this "half-word" approach is that it ends up wasting half the hardware adder logic. Unfortunately, there are no ways around this in standard C/C++.

**Hardware Specific: x86 and x64**

On x86 processors, things are much better since there is hardware support for this. The add instruction sets the carry-bit flag when the sum overflows.

The adc instruction (add-with-carry) does the same thing, except that it also adds in the carry-bit flag to the result.

So adding a pair of large integers in the standard full-word representation is as simple as chaining the adc instruction.

add rax, rbx;

adc rax, rcx;

adc rax, rdx;

adc rax, rdi;

Actual code will involve streaming data from memory. An out-of-place add will look something like this:

mov rax, QWORD PTR [r8];

add rax, QWORD PTR [r9];

mov QWORD PTR [r10], rax;

mov rax, QWORD PTR [r8 + 8];

adc rax, QWORD PTR [r9 + 8];

mov QWORD PTR [r10 + 8], rax;

mov rax, QWORD PTR [r8 + 16];

adc rax, QWORD PTR [r9 + 16];

mov QWORD PTR [r10 + 16], rax;

Broadwell processors support the ADX instruction set which adds the adcx and adox instructions. These use different flag bits which allow for 2 independent carry-chains to happen simultaneously. While this is less important for addition, it's a big deal for basecase multiplication.

Until recently (~2015-ish), there was no way access the adc instruction when programming in C/C++ aside from inline assembly. But most major compilers now support intrinsics that will generate them. Though it's worth noting that compilers nowadays vary greatly in their ability to optimize add-with-carry intrinsics.

- The Intel Compiler generates quality code for adc, but is not so great with the adcx and adox instructions.
- The Microsoft compiler also generates quality code for adc. But it never generates the adox instruction. So it's impossible to get two efficient carry-chains.
- GCC generates the terrible code for all the add-with-carry intrinsics. And there was a bug that prevented their use which wasn't fixed until GCC 5.3.

Because no compiler to date is able to generate quality code for the adcx and adox instructions, y-cruncher (as of v0.7.1) still relies heavily on inline assembly.

Since bignum addition is computationally cheap, it is generally memory-bound. So historically speaking, parallelizing bignum addition is not worthwhile.

However, things are changing now. On modern massively multi-core hardware, it is impossible for a single thread to utilize all the memory bandwidth. Combine that with Amdahl's Law and we see that sequential bignum addition is starting to become a liability.

**The Algorithm:**

The carry-chain in bignum addition has one big problem: It's a dependency chain. And that's a bad thing when attempting to parallelize.

Each word depends on the carryout from the previous word. So it must wait until the previous add is done. So addition is inherently a sequential task. And there is no parallelism that can be exploited at all. Or is there?

Fortunately, there is a way to break the chain. Addition is associative, you can change the order you do them. Which means it's possible to ignore a carry-in until later. So the idea is to break the operands into smaller blocks, add them in parallel, then sort out the carryout.

Consider the following example in base 10:

A = 4453154504161422340178736208899939126959165670131031842194475823

B = 9720295491078215560285912369089892390215253790336986963019504972

Break it up into blocks of 16 digits each:

A = 4453154504161422 3401787362088999 3912695916567013 1031842194475823

B = 9720295491078215 5602859123690898 9239021525379033 6986963019504972

Add each block independently (in parallel):

C = (1)4173449995239637 9004646485779897 (1)3151717441946046 8018805213980795

Deal with the carryout:

C = 1 4173449995239637 9004646485779897+1 3151717441946046 8018805213980795

Combine to produce final sum:

C = 14173449995239637900464648577989831517174419460468018805213980795

**Pathological Carryout:**

On average, the expensive part is the addition of the independent blocks. This is good since it can be done in parallel. The following carryout is computationally negligible.

So for an *N*-digit number and *p* processors, the running time is *O(N/p)* - which is perfect.

That is the best case, which will also be the average case for random inputs. But unfortunately, it is possible for the carryout to propagate through the entire number as in this example:

A = 9999999999999999999999999999999999999999999999999999999999999999

B = 0000000000000000000000000000000000000000000000000000000000000001

A = 9999999999999999 9999999999999999 9999999999999999 9999999999999999

B = 0000000000000000 0000000000000000 0000000000000000 0000000000000001

C = 9999999999999999 9999999999999999 9999999999999999 (1)0000000000000000

C = 10000000000000000000000000000000000000000000000000000000000000000

Since the carryout is sequential and spans all *N* digits, the worst-case running time is *O(N)* - which is the same as not parallelizing at all. And it is actually worse since you've now made *two* passes over the dataset.

While pathological carryout is theoretically rare, it *can* still happen even when the data involved is seemly random.

An example of this are in Newton's Method iterations. In some Newton's Method iterations, one of the multiplications will lead to a number of the form, "0.9999xxxxxxxx" or "1.0000xxxxxxxx". Subsequent additions or subtractions on this number may lead to pathological carryout.

**N x 1 Multiplication:**

The parallel addition algorithm above can be trivially extended to parallelize a multiplication of a large integer and a machine word. The same problems with pathological inputs also apply.

In order to achieve sub-linear worst-case, we need a way to efficiently handle all inputs. And for that, we borrow a page from hardware academia.

The Kogge-Stone Adder is a parallel prefix adder that's commonly implemented in hardware. The algorithm can perform an addition of two N-bit integers with a guaranteed latency of *O(log(N))*. And it does it for all inputs including pathological ones.

I won't go into the details of the Kogge-Stone Adder since that's what Wikipedia is for. But despite it being a hardware algorithm, it can used at a high-level to guarantee *O(N/p)* running time for all inputs.

The original Kogge-Stone Adder is a recursive radix-2 algorithm with a depth of *lg(N)*. We won't be using that. Instead, a two-level algorithm with radices of *N/p* and *p* will suffice for the purpose of y-cruncher. As of this writing, y-cruncher isn't able to parallelize to the point where *p* becomes excessively large.

The best way to explain the approach is by example:

A = 4453154504161422340178736208899939126959165670131031842194475823

B = 9720295495838578560285913791100060873041253790338968157819504972

Break it up into *p* blocks:

A = 44531545 04161422 34017873 62088999 39126959 16567013 10318421 94475823

B = 97202954 95838578 56028591 37911000 60873041 25379033 89681578 19504972

Compute *P[i] = A[i] + B[i]* ignoring overflow: (can be done in parallel)

P = 141734499 00000000 90046464 99999999 00000000 41946046 99999999 13980795

Compute bitmasks *C[i] = *true if *A[i] + B[i]* overflows, and *M[i] = *true if *A[i] + B[i]* is max block: (can be done in parallel)

C = 0 1 0 0 1 0 0 1

M = 0 0 0 1 0 0 1 0

At this point:

- C indicates which blocks generated a carry.
- M indicates which blocks will generate a carry if the carry-in to them is 1.
- P is very close to the final sum.

In fact each block in *P* is either correct or can be made correct by adding 1 (and ignoring overflow). The final step is to figure out which blocks we need to add 1 to. Fortunately, this can be done solely by examining the masks C and M. So we've effectively reduced the problem from *N* bits of data to *p* bits of data.

The remaining steps are as follows:

- Shift C up by 1 bit. This is because the carry goes to the next block.
- Iterate the bits of C and M from the bottom up:
- Index 0:
- If C[0] == 0, then P[0] is correct and we move on.
- If C[0] == 1 and M[1] == 0, then P[0] is wrong and we need to add 1.
- If C[0] == 1 and M[1] == 1, then P[0] is wrong as we need to add 1. Furthermore we need to set carry to the next bit.
- Index 1:
- If C[1] == 0 and M[1] == 0 and carry == 0, then P[1] is correct and we move on.
- If C[1] == 0 and M[1] == 0 and carry == 1, then P[1] is wrong and we need to add 1.
- If C[1] == 0 and M[1] == 1 and carry == 0, then P[1] is correct and we move on.
- If C[1] == 0 and M[1] == 1 and carry == 1, then P[1] is wrong and we need to add 1. Furthermore we need to set carry to the next bit.
- If C[1] == 1 and M[1] == 0 and carry == 0, then P[1] is wrong and we need to add 1.
- If C[1] == 1 and M[1] == 0 and carry == 1, this case is mathematically impossible.
- If C[1] == 1 and M[1] == 1 and carry == 0, then P[1] is wrong and we need to add 1. Furthermore we need to set carry to the next bit.
- If C[1] == 1 and M[1] == 1 and carry == 1, this case is mathematically impossible.

- Repeat for all the indices.

Finishing Up:

C = 0 1 0 0 1 0 0 1

C' = 1 0 0 1 0 0 1 0 (C shifted left by 1)

M = 0 0 0 1 0 0 1 0

Blocks to Correct: 1 0 1 1 0 1 1 0

P = 141734499 00000000 90046464 99999999 00000000 41946046 99999999 13980795

S = 141734500 00000000 90046465 00000000 00000000 41946047 00000000 13980795

A + B = 14173450000000000900464650000000000000000419460470000000013980795

When we need to add 1 to a block, it has the potential to carry through the entire block taking up *O(N/p)* time. But each block is independent so they can all be done in parallel. The iteration to determine which blocks to add 1 is not parallelizable due to the carry logic. But it covers only *p* bits of data - which is negligible compared to the original *N* bits. So in the end the complexity of adding two *N*-bit integers with *p* processors remains *O(N/p)* in time.

If you look closer at the iteration over the C and M bits, you'll notice that each step is equivalent to a full-adder. In other words, the entire iteration is just an integer addition of *p*-bit integers. Therefore, it is a recursive algorithm. So if the *p*-bit addition is too large, it can be done using Kogge-Stone again.

**Background:**

One of the many y-cruncher side-projects was to see if the Kogge-Stone adder can be done efficiently at a low level using SIMD. Guess what? The answer is yes.

In other words, it is possible to add together a vector type (like __m512i) as if it were a 512-bit integer.

At first, this may seem like a significant breakthrough in bignum arithmetic. But in reality, it's not really that useful:

- Applications that target small and medium-sized numbers benefit more from partial word representations which are more efficient to vectorize. This is going to be a big deal with Cannonlake processors thanks to the AVX512-IFMA instruction set.
- Applications that target large numbers spend all their time doing large multiplications. For example, y-cruncher spends less than 0.5% of the CPU time doing bignum addition for a typical computation.
- The approach presented here has enough overhead that it needs AVX512 to outperform a chain of add-with-carry instructions on x64. This limits its usability.

So the most viable place to use vectorized addition is in legacy code that is "stuck" with full-word representations.

Another possibility for the vectorized add is the Schönhage-Strassen Algorithm (SSA) which consists largely of additions and subtractions. But the benefits of that are quite limited since the performance of SSA has fallen so far behind other (vectorizable) algorithms that a Kogge-Stone add is unlikely to save it.

**AVX512-DQ:**

Below is C++ intrinsic code for a 512-bit full-adder. The carry parameter indicates the carry-in and is replaced with the carry-out. It must be either 0 or 1.

__m512i add512(__mmask16& carry, __m512i A, __m512i B){

const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

__m512i s = _mm512_add_epi64(A, B);

__mmask16 c = _mm512_cmplt_epu64_mask(s, A);

__mmask16 m = _mm512_cmpeq_epi64_mask(s, MAX_WORD);

{

c <<= 1;

c = _mm512_kor(c, carry);

c += m;

carry = c >> 8;

m = _mm512_kxor(m, c);

}

return _mm512_mask_sub_epi64(s, m, s, MAX_WORD);

}

When compiled as written, all the mask operations in the middle brackets will be done entirely using AVX512 mask registers. The c += m becomes the kaddw instruction. The kadd instructions seemed weird when first introduced since it didn't make any sense to add masks. But apparently, this is a legitimate use-case.

But in practice, this doesn't happen:

- Compilers (including the Intel Compiler) rarely generate complicated sequences of AVX512 mask instructions. They prefer to move the data into general purpose registers to do the work, then move them back into mask registers.
- Mask register instructions on Skylake X are slow. They are 1-per cycle, and instructions which have cross-bit dependencies take 4 cycles.

So as of Skylake X, the sequence above is not efficient as written since mask arithmetic is slow.

If compiled as written, this sequence becomes 9 instructions with all parameters in registers or 11 instructions when streaming through memory.

The critical path carryout dependency chain is 3 (slow) mask instructions or 12 cycles on Skylake X. This sequence is latency-bound on Skylake X.

**AVX512-F:**

A small tweak to the AVX512-DQ approach will make it work on AVX512-F for Knights Landing with a couple extra instructions.

__m512i add512(uint32_t& carry, __m512i A, __m512i B){

const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

__m512i s = _mm512_add_epi64(A, B);

__mmask16 c = _mm512_cmplt_epu64_mask(s, A);

__mmask16 m = _mm512_cmpeq_epi64_mask(s, MAX_WORD);

{

uint32_t c0 = _mm512_mask2int(c);

uint32_t m0 = _mm512_mask2int(m);

carry += m0;

carry = (carry + c0*2); // lea

m0 ^= carry;

carry >>= 8;

m = _mm512_int2mask(m0);

}

return _mm512_mask_sub_epi64(s, m, s, MAX_WORD);

}

This version has a higher instruction count than the first example with the AVX512-DQ kaddw instruction. But since there are no expensive mask instructions, it actually runs faster on Skylake X.

If compiled as written, this sequence becomes 11 instructions with all parameters in registers or 13 instructions when streaming through memory.

The critical path carryout dependency chain is 3 GPR integer instructions or 3 cycles on Skylake X. This sequence is throughput-bound on Skylake X.

When streaming through memory, this sequence is about 2x faster than the classic approach of chaining adc instructions on Skylake X.

**AVX2:**

Just for kicks, here's an AVX2 implementation. The lack of unsigned compare and mask-to-vector instructions significantly complicates things:

const __m256i BROADCAST_MASK[16] = {

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000001, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000000, 0x8000000000000001, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000000, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000000, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000001, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000000, 0x8000000000000001, 0x8000000000000001, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000000, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000001, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000000, 0x8000000000000001, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000000, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000000, 0x8000000000000001),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000001, 0x8000000000000000),

_mm256_set_epi64x(0x8000000000000001, 0x8000000000000001, 0x8000000000000001, 0x8000000000000001),

};

__m256i add256(uint32_t& carry, __m256i A, __m256i B){

A = _mm256_xor_si256(A, _mm256_set1_epi64x(0x8000000000000000));

__m256i s = _mm256_add_epi64(A, B);

__m256i cv = _mm256_cmpgt_epi64(A, s);

__m256i mv = _mm256_cmpeq_epi64(s, _mm256_set1_epi64x(0x7fffffffffffffff));

uint32_t c = _mm256_movemask_pd(_mm256_castsi256_pd(cv));

uint32_t m = _mm256_movemask_pd(_mm256_castsi256_pd(mv));

{

c = m + 2*c; // lea

carry += c;

m ^= carry;

carry >>= 4;

m &= 0x0f;

}

return _mm256_add_epi64(s, BROADCAST_MASK[m]);

}

This monstrosity compiles to 13 instructions (14 when streaming through memory). The initial XOR and the tweaking of the constants are work-arounds for the lack of an unsigned integer compare. The lookup table is a work-around for the lack of a mask-to-vector instruction.

Surprisingly, benchmarks show this to only be about 10% slower than a chain of 4 adc instructions when streaming through memory.

For the sake of completeness, here are the full-subtractors. They are the same except for a few sign-flips and tweaks of the constants.

Again the carry-in must be either 0 or 1. 1 indicates a borrow in.

**AVX512-DQ:**

__m512i sub512(__mmask16& carry, __m512i A, __m512i B){

const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

__m512i s = _mm512_sub_epi64(A, B);

__mmask16 c = _mm512_cmpgt_epu64_mask(s, A);

__mmask16 m = _mm512_cmpeq_epi64_mask(s, _mm512_setzero_si512());

{

c <<= 1;

c = _mm512_kor(c, carry);

c += m;

carry = c >> 8;

m = _mm512_kxor(m, c);

}

return _mm512_mask_add_epi64(s, m, s, MAX_WORD);

}

**AVX512-F:**

__m512i sub512(uint32_t& carry, __m512i A, __m512i B){

const __m512i MAX_WORD = _mm512_set1_epi64(0xffffffffffffffff);

__m512i s = _mm512_sub_epi64(A, B);

__mmask16 c = _mm512_cmpgt_epu64_mask(s, A);

__mmask16 m = _mm512_cmpeq_epi64_mask(s, _mm512_setzero_si512());

{

uint32_t c0 = _mm512_mask2int(c);

uint32_t m0 = _mm512_mask2int(m);

carry += m0;

carry = (carry + c0*2); // lea

m0 ^= carry;

carry >>= 8;

m = _mm512_int2mask(m0);

}

return _mm512_mask_add_epi64(s, m, s, MAX_WORD);

}

**AVX2:**

Using the same lookup table as the full-adder...

__m256i sub256(uint32_t& carry, __m256i A, __m256i B){

A = _mm256_xor_si256(A, _mm256_set1_epi64x(0x8000000000000000));

__m256i s = _mm256_sub_epi64(A, B);

__m256i cv = _mm256_cmpgt_epi64(s, A);

__m256i mv = _mm256_cmpeq_epi64(s, _mm256_set1_epi64x(0x8000000000000000));

uint32_t c = _mm256_movemask_pd(_mm256_castsi256_pd(cv));

uint32_t m = _mm256_movemask_pd(_mm256_castsi256_pd(mv));

{

c = m + 2*c; // lea

carry += c;

m ^= carry;

carry >>= 4;

m &= 0x0f;

}

return _mm256_sub_epi64(s, BROADCAST_MASK[m]);

}