# Proposal: augmentedSum and related operations

## Table of Contents

- 1. Description
- 2. Issues
- 3. Recently resolved issues
- 4. Rationale and background
- 5. twoProduct / augmentedMultiplication
- 6. Proposed text (Recommended Operations)
- 7. Acknowledgments
- 8. Appendix: C examples

Last update: 2016-10-12 Wed 13:33

## 1 Description

Add an operation, with proposed name augmentedAddition and historical name twoSum, that produces a pair of source-format results that accurately represents x+y to support both extra precision as well as reproducible summation. Additionally, add augmentedSubtraction (x-y → head, tail) and augmentedMultiplication (x*y → head, tail) to support accelerating double-double and similar arithmetics.

## 3 Recently resolved issues

- Add a sentence explaining why we recommend only binary format operations.
**Resolution:**Added "Note that this standard recommends only the operations for binary formats because the requirements to address augmenting decimal format arithmetic are not yet determined."

- Replace "two" with "augmented" in operation names.
**Resolution**: Done.

- Minor naming nit: Should the operations' names be derived from the existing operation names? The operations in Clause 5.4.1 are addition, subtraction, and multiplication.
**Resolution**: Change to "augmentedAddition", etc. (23 September 2016 meeting)

- In decimal, how should the preferred exponent be applied? twoSum as a whole is exact unless overflow occurs, so the preferred exponent
*will*apply. The preferred exponent (and possible quantum exception) ease implementing fixed-point arithmetic, so those needs should guide the specification. (Being discussed in email.)**Resolution**: Suggest only for binary unless there is a clearly correct semantics in decimal.

- Some naming issues. What should be the Clause's title? And should the operations be named twoAddition, etc. to match Clause 5.4.1 or remain twoSum, etc. to match common usage?
- Suggested titles: "Tail Operations", "Operations with Remainder", "Residue Operations", "Compensating Operations". All risk confusion with other concepts.
**Resolution**: Be clear rather than terse. "Operations to Support Extending Arithmetic Precision"**Re-Resolution**: Change "extending" to "augmenting". (23 September 2016 meeting)

- Should any of this be defined for decimal? How much of this works correctly under decimal? Are there accompanying proofs?
We refer to Priest*Should the operations be limited to binary?*^{1}for algorithms that implement double-double decimal arithmetic using the proposed new twoSum operation. It is an open question whether the reproducible summation algorithm in Ahrens, Demmel, and Nguyen can be extended to work in decimal arithmetic. (Volunteers welcome. Some binary assumptions, e.g. leading and last*bits*, need thought.)**Resolution**: Yes, it*should*be definedwe can decide on semantics for the preferred exponent.*if***Update**: Reduced to a recommendation for binary operationsa clearly correct semantics for decimal appears.*unless*

- The operation definition did not nail down how many
**new**NaNs are produced. Or, for a multi-instruction implementation,**when**they are produced.*Recent suggestions are to produce a single NaN.***Resolution**: Producing the same NaN for both outputs on an invalid operation was the sense at the 14 July 2016 meeting.

- Inexact on overflow: Overly redundant or necessary because the result
*is*inexact?**Resoluton**: Signal inexact. (14 July 2016 meeting and email)

- Is specifying exceptional cases for twoProduct a sufficient reason to include it given it is inexact around underflow?
**Resolution**: Yes, as there appears to be no harm and provides important guidance. (14 July 2016 meeting)

## 4 Rationale and background

Non-standard extended precisions like double-double or quad-double optimize multiplication by using fusedMultiplyAdd to provide a two-operation sequence to convert a pair of floating-point inputs into both the rounded product of those inputs and the exactly representable difference between the exact and rounded products. There is no such efficient operation to assist addition. One fully general implementation for addition requires six operations; see the twoSum code below for one implementation. The ReproBLAS algorithm for reproducible summation uses a chain of operations similar to the quickTwoSum code below in its inner loop.

The performance of addition becomes a bottleneck for both extended precision and reproducible arithmetic. We propose adding the operation twoSum to encourage optimized implementations. The implementation may consist of a single operation or a sequence of operations that produces the two results.

We see three options for defining this operation:

- Fully specify the operation according to mathematical justifications for exceptional behavior.
- Fully specify the operation to match some specific software implementation's behavior.
- Fully specify the operation only on finite, non-zero results to permit accelerating different current software implementations.

This background outlines what is needed for the first two options. The proposed text in Section 6 takes the first option and fully specifies the operation according to mathematical justifications.

It is important to note that many practical uses of twoSum for extending precision would be far less interesting if hardware simply implemented binary128 with performance no more than 2× to 4× slower than binary64. The reproducible arithmetic algorithms referenced, however, still benefit from an efficiently implemented operation following the proposed twoSum specification.

Some issues that remain in this proposal are described in Section 2 above. Decimal support provides the main sticking point, which this proposal side-steps. The proposed operations are named augmentedSum, etc. rather than twoSum to stand apart from existing implementations.

### 4.1 Properties of an existing software implementation

Here we outline one C implementation of the twoSum operation to provide a backdrop for describing desired exceptional behavior. We also provide a variation, quickTwoSum, that makes additional assumptions on its input.

The C implementation follows an algorithm of Knuth's in the second volume of TAoCP and makes no explicit assumptions on its input:

void twoSum (const double x, const double y, double * out_head, double * out_tail) { const double s = x + y; const double bb = s - x; const double err = (x - (s - bb)) + (y - bb); *out_head = s; *out_tail = err; }

For cases where twoSum(x, y) produces finite numbers as results there is no error, but this implementation does signal inexact. That exception could be masked using optional C99 features at some performance cost. Table 1 summarizes the above twoSum's behavior except for one case of intermediate overflow discussed after quickTwoSum below.

x | y | head | tail | Signal | Note |
---|---|---|---|---|---|

NaN_{1} |
NaN_{2} |
NaN_{1} or NaN_{2} |
NaN_{1} or NaN_{2} |
on sNaN | |

∞ | NaN_{2} |
NaN_{2} |
NaN_{2} |
on sNaN | |

-∞ | NaN_{2} |
NaN_{2} |
NaN_{2} |
on sNaN | |

NaN_{1} |
∞ | NaN_{1} |
NaN_{1} |
on sNaN | |

NaN_{1} |
-∞ | NaN_{1} |
NaN_{1} |
on sNaN | |

∞ | ∞ | ∞ | NaN (new) | invalid | |

-∞ | ∞ | NaN (new) | NaN (new) | invalid | |

∞ | -∞ | NaN (new) | NaN (new) | invalid | |

-∞ | -∞ | -∞ | NaN (new) | invalid | |

X | Y | ∞ | NaN (new) | invalid | X+Y rounds to ∞ |

-X | -Y | -∞ | NaN (new) | invalid | -X+-Y rounds to -∞ |

X | Y | round(X+Y) | X+Y-round(X+Y) | inexact | round(Z) rounds the expression Z |

+0 | +0 | +0 | +0 | (none) | |

+0 | -0 | +0 | +0 | (none) | |

-0 | +0 | +0 | +0 | (none) | |

-0 | -0 | -0 | +0 | (none) |

When the exponent of x is known to be at least that of y, which is satisfied when |x| ≥ |y|, the quickTwoSum routine below produces an exact transformation into a head and a tail. This routine is the workhorse in many codes where the input magnitudes are known (e.g. the bins in the ReproBLAS). Both of these example implementations share many of the output behaviors in Table 1 except for a few exceptional cases. On overflow, this quickTwoSum version below produces an Inf tail with the opposite sign and does not signal invalid. Also, quickTwoSum(-0, -0) produces two -0 outputs while twoSum(-0, -0) produces (-0, +0). quickTwoSum does not suffer from overflow after the first addition.

void quickTwoSum (const double x, const double y, double * out_head, double * out_tail) { const double s = x + y; const double bb = s - x; const double err = y - bb; *out_head = s; *out_tail = err; }

An additional intermediate overflow does occur in the above twoSum implementation. As related to the case in Boldo, Graillat, and Muller^{2}, twoSum(x, y) overflows in computing `bb`

when y is the largest representable finite number, Ω = (2 - 2^{(1-p)}) 2^{Emax}, and x is -(3/2) ulp(Ω). In that and related cases where `bb`

overflows, twoSum produces Inf and NaN as results. Amusingly, fixing this may require a conditional within twoSum, so twoSum may as well check magnitudes and call quickTwoSum.

To provide further context, implementations using twoSum and quickTwoSum to implement double-double addition and vectorized extended-precision summation are provided in an appendix. We also provide versions that implement our preferred mathematical properties to show the costs imposed on software implementations.

### 4.2 Desired mathematical properties for rounding and exceptional behavior

The appended implementations of twoSum and quickTwoSum disagree on exceptional behavior and the sign of zero. Other implementations of equivalent operations may have different disagreements, for example an implementation of twoSum that compares argument magnitudes and then calls quickTwoSum. That will agree with quickTwoSum's behavior except on which NaN is propagated and possibly zero signs. Replacing `bb = s - x`

with `bb = x - s`

and `err = y - bb`

with `err = bb - y`

in quickTwoSum above makes quickTwoSum(-0, -0) the same as twoSum(-0, -0), but now the cases that mix +0 and -0 differ.

Traditionally, IEEE-754 has chosen to reduce differences between different platforms. We prefer fully specifying the operation over permitting platform-specific behavior, and we do not want to choose one implementation without additional justification.

We want to support both extended arithmetic and reproducible arithmetic. Extended arithmetic like double-double and quad-double do not specify exceptional behavior because that behavior depends on implementation minutia. Fully specifying twoSum is a step towards improving these arithmetics' cross-platform behavior.

Reproducible summation across different orderings is a more recent concern brought to the fore by massive parallelism even within laptops and devices. The algorithms are not as well established. We look at one example for some guidance, the reproducible BLAS implementation in the ReproBLAS. The algorithms for reproducible sum and dot product scale to avoid depending on the overflow behavior of quickTwoSum. Those algorithms do depend on gradual underflow; flush-to-zero complicates and slows the algorithms. The ReproBLAS explicitly do not differentiate between NaNs of different payloads or between zeros of different signs, but future reproducible BLAS implementations may be able to build on a fully specified twoSum to extend their results. Overall, fully specifying twoSum helps with reproducibility. Networked systems have demonstrated the difficulties of maintaining software in the face of mixed arithmetic semantics.

#### 4.2.1 Rounding attribute

Operations like twoSum and the previously proposed additionTail only provide an exact transformation with the existing rounding attributes of roundNearestEven and roundTiesToAway. Unidirectional rounding attributes like roundTowardPositive may require more bits than two *sourceFormat* results can provide. Consider twoSum(-2^{100}, 2^{-200}) with roundTowardPositive in binary64 (binary double precision). The second result of twoSum, the tail, would have to represent -2^{47} + 2^{-200} using only the 64 bits available. So the twoSum operation cannot both support all rounding attributes and remain an exact transformation for all finite results.

As seen in Ahrens, Demmel, and Nguyen's draft "Exploring ways to support reproducible summation in a future IEEE 754 Standard" ^{3}, the twoSum operation with roundTiesToAway supports efficient reproducible summation and dot product. This leads to fast reproducible matrix multiplication and beyond; see the ReproBLAS implementation and analysis. Reproducibility of summation and dot product benefits users of platforms with varying numbers of floating-point execution units, including exascale distributed systems as well as different product models of CPUs or GPUs.

This proposal specifies roundTiesToAway for all twoSum operations. A single, static rounding attribute removes the edge cases related to directed rounding. Both double-double and reproducible summation function under roundTiesToAway, so we choose this attribute.

A recent report^{2} proves that unless overflow occurs, an *inexact* tail has error only in the last digits. Permitting all rounding attributes with that reference's caveat may be feasible. We prefer to specify a single, fixed rounding attribute with known uses. As always, implementations can provide non-standard operations with additional features should applications appear.

#### 4.2.2 Signaling inexact

Software implementations often will raise inexact even though the operation as a whole *is* a provably exact transformation (in the above rounding attributes). Implementations with additional hardware assistance can avoid raising inexact. While the utility of inexact may be questioned, we should at least propose a single behavior and chose to raise inexact only on overflow for twoSum. Inexact is redundant with overflow in this case and could be removed.

If implemented as a sequence of two operations, the first (assumed to be an addition) must have its rounding attribute fixed to roundTiesToAway and could suppress inexact in this fixed two-operation usage.

#### 4.2.3 Overflow behavior

In general, doubled arithmetics (double-double, double-single, etc.) break down on overflow. The ad-hoc behaviors of twoSum and quickTwoSum above differ. twoSum(x, y) delivers (Inf, NaN) when x+y overflows, while quickTwoSum on the same arguments delivers (Inf, -Inf). Neither is entirely satisfactory. Continued accumulation into a double-double type based on either routine turns overflow into NaNs, which also is not satisfactory. Below we define twoSum to produce (Inf, Inf) or (-Inf, -Inf) on overflow and when inputs are infinities with the same sign. Then doubled arithmetic emulates standard arithmetic behavior on overflow. We include example source code demonstrating this behavior at the end.

There are other possibilities that can expand range slightly at some complexity in analysis as have been mentioned in email. We prefer behavior simpler to explain.

#### 4.2.4 Underflow behavior

Ignoring overflow, specifying default exceptional behavior (gradual underflow) guarantees that roundTiesToAway(x+y) can be expressed in *sourceFormat*. The distance from zero to roundTiesToAway(x+y) also can be expressed in *sourceFormat*, so twoSum can provide appropriate results. On underflow with gradual underflow, x+y is **exact**, so the options for detecting tiny results in binary both signal underflow. In this case, twoSum returns the exact underflowed sum x+y as the first result, zero for the second result, and signals underflow.

The alternate exception handling that flushes underflow results (or inputs) to zero cannot even return the results. If the head is flushed to zero, then the error would be the quantity just flushed to zero and hence also be flushed to zero. Everything breaks, and users don't even get to keep the pieces.

#### 4.2.5 Signed zero behavior

Unfortunately with the above implementation twoSum(-0.0, -0.0) returns -0.0 and 0.0. This result is difficult to specify in a standards text. That twoSum and quickTwoSum differ on the sign for the second result is annoying, although altering the signs in the implementations can generate other behaviors on zeros while preserving the overall behavior. Instead of leaving the sign unspecified, we prefer a single, simply specified result and choose twoSum on two zero arguments to produce the same sign in both results as addition on those zero arguments.

#### 4.2.6 Summary

The intended proposed behavior is summarized in Table 2.

x | y | head | tail | Signal | Note |
---|---|---|---|---|---|

NaN_{1} |
NaN_{2} |
NaN_{1} or NaN_{2} |
NaN_{1} or NaN_{2} |
on sNaN | |

∞ | NaN_{2} |
NaN_{2} |
NaN_{2} |
on sNaN | |

-∞ | NaN_{2} |
NaN_{2} |
NaN_{2} |
on sNaN | |

NaN_{1} |
∞ | NaN_{1} |
NaN_{1} |
on sNaN | |

NaN_{1} |
-∞ | NaN_{1} |
NaN_{1} |
on sNaN | |

∞ | ∞ | ∞ | ∞ | (none) | |

-∞ | ∞ | NaN (new) | NaN (new) | invalid | |

∞ | -∞ | NaN (new) | NaN (new) | invalid | |

-∞ | -∞ | -∞ | -∞ | (none) | |

X | Y | ∞ | ∞ | overflow, inexact | X+Y rounds to ∞ |

-X | -Y | -∞ | -∞ | overflow, inexact | -X+-Y rounds to -∞ |

X | Y | rta(X+Y) | X+Y-rta(X+Y) | (none) | rta(Z) rounds the expression Z under roundTiesToAway |

+0 | +0 | +0 | +0 | (none) | |

+0 | -0 | +0 | +0 | (none) | |

-0 | +0 | +0 | +0 | (none) | |

-0 | -0 | -0 | -0 | (none) |

#### 4.2.7 Preferred exponent

If decimal is supported, which remains an open question, the operation must specify a preferred exponent to render operation results unambiguous. The preferred exponent of addition or subtraction of x and y is min(Q(x), Q(y)) as defined in the standard. Using the notation from the table above, the preferred exponent of x + y - rta(x + y) would be min(Q(x), Q(y), min(Q(x), Q(y))) = min(Q(x), Q(y)). So both the head and tail have the same preferred exponent, min(Q(x), Q(y)).

Michel Hack, for one, points out that if the operations are used to emulate a wider precision in fixed-point arithmetic, the head+tail should be considered as a single number. Preserving the (decimal) quantum on the combined number may require different rules.

### 4.3 Potential performance impacts of hardware support

Current software implementations of double-double vector sum reductions and dot products require a long, sequential string of operations to collect the results within register-blocked kernels. That strongly penalizes performance of double-double arithmetic. A 2016 study ^{4} compared the performance of routines based on the above twoSum with an emulated two-operation routine that produces incorrect numerical results but good timing estimates. On most architectures, they replace the tail calculation with `min(a, b)`

to quantify by how much performance could improve. Doubled-double matrix multiplication, the workhorse for matrix factorizations, improves by 1.28× on Knights Landing and by nearly 2× on Haswell and Skylake. Dot product using double-double internally shows improvements ranging from 1.3× to 2×. A similar experiment with reproducible summation, useful for termination testing in iterative methods, shows performance improvements by up to 1.57× ^{3}. Efficient reproducibility benefits not only massive distributed systems but also product lines of accelerators / GPUs that contain varying numbers of execution units. There is the possibility that fast-but-unstable numerical algorithms can use doubled arithmetic to gain speed while staying within their region of stability ^{5}, but these can be difficult to measure through analogous experiments when they include post-processing that relies on the numerical values.

The error from multiplication can be returned by fusedMultiplyAdd. A similar twoProduct operation could provide more efficient operation by computing both the product and error in parallel through one hardware operation or two non-dependent hardware operations. However, often performance in double-double dot products or matrix multiplication is more limited by a pile-up of additions than by waiting on the multiplication. Double-double polynomial evaluation as occurs in some elementary function implementations may be more strongly affected, see the reference ^{4}, but we have not tested this idea specifically.

## 5 twoProduct / augmentedMultiplication

Many justifications for twoSum's exceptional behavior come from double-double arithmetic. twoSum is only one of the two primary operations involved. A twoProduct that produces a head and tail from multiplication is the other. A typical twoProduct implementation based on fusedMultiplyAdd falls apart on overflow, producing Inf and NaN results. In a dot product, that will produce a (NaN, NaN) pair on accumulation. Currently, the ReproBLAS reproducible dot product does not evaluate the product using doubled-double, but a fast and fully-defined twoProduct could permit better accuracy along with reproducibility.

Note that the historical name twoProduct is replaced by augmentedMultiplication in the proposed text below.

Specifying twoProduct does not lead to much of a slippery slope. Division and square root are not so simple; there can be no "error free" output pair in the result space. Their software implementations do not have obvious places to accelerate outside of the uses of twoSum and twoProduct. Conditionals to handle exceptional cases may not impact performance much, although there is no data. And unlike complex arithmetic, existing languages do not specify incompatible behavior.

One difference is that twoProduct has an additional inexact case. twoProduct is not an exact transformation if some piece of the (head, tail) representation lies below the underflow threshold. This can occur either when the head underflows entirely or when some trailing bits lie below b^{(emin-p)} in magnitude. The error cannot be represented in the *sourceFormat*. The fusedMultiplyAdd implementation in Section 8 also returns a zero tail with the opposite sign as the head if the head underflows but is not zero.

The new clause is only a recommendation but provides guidance towards desired behavior. The example code in the Section 8 includes a careful implementation of this proposed operation along with examples comparing underflow behavior.

So far, the only known use of twoProduct is implementing extended precisions like double-double. The ReproBLAS scales to avoid overflow and ignores underflow. Performance of polynomial evaluation may benefit from an operation like twoProduct that removes the data dependency between `b = a*b`

and `t = fma (a, b, -p)`

but likely does not benefit from the careful exceptional case definitions. Precisions like double-double, quad-double, etc. benefit from overflow semantics similar to typical IEEE-754 arithmetic.

## 6 Proposed text (Recommended Operations)

### 6.1 Clause 5.1 (Overview)

Change the text

Each of the computational operations that return a numeric result specified by this standard shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result, if necessary, to fit in the destination’s format (see 4 and 7).

to permit the necessary "intermediate rounding" in augmentedAddition.

Proposed new text:

Unless otherwise specified, each of the computational operations that returns numeric results shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result, if necessary, to fit in the destination’s format (see 4 and 7).

### 6.2 Clause 9.1 (Conforming language- and implementation-defined operations)

Replace "as many of the operations of 9.2" with "as many of the operations of Clauses 9.2 and 9.5."

### 6.3 New Clause 9.5 (Operations to support augmenting arithmetic precision) under 9 (Recommended operations)

Create a new clause under Clause 9 with the following text:

Language standards should define for binary formats, to be implemented according to 9.1, this clause's operations as is appropriate to the language. As with other operations of this standard, the name of the operations specified in this clause do not necessarily correspond to the name or names that any particular programming language would use. These homogeneous operations produce a pair of

sourceFormatresults denoted(sourceFormat, sourceFormat). Note that this standard recommends only the operations for binary formats because the requirements to address augmenting decimal format arithmetic are not yet determined.

(sourceFormat, sourceFormat)augmentedAddition(source,source)The operation augmentedAddition(x, y) computes both the exact sum x+y rounded to

sourceFormatusing the roundTiesToAway rounding attribute and the error in rounding the sum when the rounded x+y is finite. This operation is specified only under default exception handling (Clause 7). If roundTiesToAway(x+y) denotes the infinitely precise result of x+y rounded tosourceFormatusing the roundTiesToAway rounding attribute, then augmentedAddition produces roundTiesToAway(x+y) and x+y-roundTiesToAway(x+y) when roundTiesToAway(x+y) is a finite number. The operation propagates a NaN as both results if any input is NaN (Clause 6.2.3). If x+y is zero, both produced results are the result of x+y (and have the same sign). If x+y is infinite, both produced results are the result of x+y, and the operation signals like x+y. If x+y is invalid (Clause 7.2), the operation is invalid and produces the same quiet NaN for both outputs. The operation signals inexact only when roundTiesToAway(x+y) overflows.

(sourceFormat, sourceFormat)augmentedSubtraction(source,source)The operation augmentedSubtraction(x, y) computes both the exact difference x-y rounded to

sourceFormatusing the roundTiesToAway rounding attribute and the error in rounding the difference when the rounded x-y is finite. This operation is specified only under default exception handling (Clause 7). If roundTiesToAway(x-y) denotes the infinitely precise result of x-y rounded tosourceFormatusing the roundTiesToAway rounding attribute, then augmentedSubtraction produces roundTiesToAway(x-y) and x-y-roundTiesToAway(x-y) when roundTiesToAway(x-y) is a finite number. The operation propagates a NaN as both results if any input is NaN (Clause 6.2.3). If x-y is zero, both produced results are the result of x-y (and have the same sign). If x-y is infinite, both produced results are the result of x-y, and the operation signals like x-y. If x-y is invalid (Clause 7.2), the operation is invalid and produces the same quiet NaN for both outputs. The operation signals inexact only when roundTiesToAway(x-y) overflows.

(sourceFormat, sourceFormat)augmentedMultiplication(source,source)The operation augmentedMultiplication(x, y) computes both the exact product x*y rounded to

sourceFormatusing the roundTiesToAway rounding attribute and the error in rounding the product when the rounded x*y is finite and does not underflow. This operation is specified only under default exception handling (Clause 7). If roundTiesToAway(x*y) denotes the infinitely precise result of x+y rounded tosourceFormatusing the roundTiesToAway rounding attribute, then augmentedMultiplication produces roundTiesToAway(x*y) and x*y-roundTiesToAway(x*y) when x*y-roundTiesToAway(x*y) can be represented exactly as a finite number insourceFormat. The operation propagates a NaN as both results if any input is NaN (Clause 6.2.3). If x*y is zero, both produced results are the result of x*y and have the same sign. If x*y is infinite, both produced results are the result of x*y, and the operation signals as x*y. If x*y is invalid (Clause 7.2), the operation is invalid and produces the same quiet NaN for both outputs. If x*y-roundTiesToAway(x*y) is non-zero and non-infinite but cannot be represented exactly insourceFormat-pair(because some non-zero digits lie strictly between ±b^{(emin-p+1)}), the results are roundTiesToAway(x*y) and roundTiesToAway(x*y-roundTiesToAway(x*y)). This case signals inexact always and signals underflow under the conditions of Clause 7.5.

Much of the first paragraph is copied from an earlier clause. Because this is not a single mathematical function and also returns a *pair* of results, augmentedAddition does not fit well in Table 9.1. A separate section reduces clause-number change and eases adding other operations.

This clause could be placed earlier; this proposal avoids re-numbering references by appending to the existing clauses.

### 6.4 Annex A (Bibliography)

Add the following references for uses of augmentedAddition (after normalizing to some bibliographic standard):

- Ahrens, P, Nguyen, H. D., and Demmel, J. "Efficient Reproducible Floating Point Summation and BLAS," EECS Department, UC Berkeley, Technical Report No. UCB/EECS-2016-121, June 18, 2016. http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html
- Priest, D. M. Algorithms for arbitrary precision floating point arithmetic. In the Proceedings of the 10th IEEE Symposium on Computer Arithmetic, 1991. (pp. 132-143). doi:10.1109/ARITH.1991.145549

Potential references for additional background and uses:

- Dekker, T. J. "A Floating-Point Technique for Extending the Available Precision," Numer. Math., 18(3), 1971, pp. 224-242. doi:10.1007/BF01397083
- Hida, Y., Li, X. S., and Bailey D. H. "Algorithms for quad-double precision floating point arithmetic,"
*15th IEEE Symposium on Computer Arithmetic*, IEEE Computer Society, 2001, pg. 155-162. doi:10.1109/ARITH.2001.930115 - Rump, S. M., Ogita, T., and Oishi, S. "Fast high precision summation."
*Nonlinear Theory and Its Applications (NOLTA)*, IEICE, 1(1), 2010. doi:10.1587/nolta.1.2 - Shewchuk, J. R. "Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates,"
*Discrete & Computational Geometry*18:305-363, 1997. doi:10.1007/PL00009321

If we also are adding performance rationale, add the following:

- Dukhan, M., Vuduc, R., and Riedy, J. "Wanted: Floating-Point Add Round-off Error instruction,"
*Proceedings of the 2nd International Workshop on Performance Modeling: Methods and Applications (PMMA16)*at ISC High Performance, July 2016. Also available at ArXiv e-prints: http://arxiv.org/abs/1603.00491 - Yamazaki, I., Tomov, S., and Dongarra, J. "Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs" SIAM J. Sci. Comput. 37(3): C307-C330 (2015). doi:10.1137/14M0973773

Technical reports may be replaced by publications before the IEEE-754 draft containing augmentedAddition comes to a vote.

## 7 Acknowledgments

James Demmel (UCB), Xiaoye Li (LBNL), Greg Henry (Intel), and Peter Tang (Intel) have provided enough detailed feedback that they should be considered proposal co-authors.

Peter Ahrens (UCB->MIT) and Hong Diep Nguyen (UCB) have answered many questions about the ReproBLAS and interactions with twoSum.

Michel Hack (IBM) and others pushed back on the single-implementation specification looking for proper justification on exceptional behavior.

Marat Dukhan (GT) quantified many *additionError* primitive performance benefits, spurring the twoSum revival.

## 8 Appendix: C examples

Refer to qd at http://crd-legacy.lbl.gov/~dhbailey/mpdist/ for a debugged, tested implementation of double-double and quad-double arithmetic. The included source is not an example of good style and may include bugs but should give an idea of how twoSum (and potentially twoProduct) appear in actual code.

First, collect the primitives above into one header, here `dd-primitives.h`

.

static void twoSum (const double x, const double y, double * out_head, double * out_tail) { const double s = x + y; const double bb = s - x; const double err = (x - (s - bb)) + (y - bb); *out_head = s; *out_tail = err; } static void quickTwoSum (const double x, const double y, double * out_head, double * out_tail) { const double s = x + y; const double bb = s - x; const double err = y - bb; *out_head = s; *out_tail = err; } static void twoDifference (const double x, const double y, double * out_head, double * out_tail) { const double s = x - y; const double bb = s - x; const double err = (x - (s - bb)) - (y + bb); *out_head = s; *out_tail = err; } static void quickTwoDifference (const double x, const double y, double * out_head, double * out_tail) { const double s = x - y; const double bb = x - s; const double err = bb - y; *out_head = s; *out_tail = err; } static void twoProduct (double a, double b, double *out_head, double *out_tail) { double p = a * b; double t = fma (a, b, -p); *out_head = p; *out_tail = t; }

A more careful and slower implementation that handles exceptional cases follows in `dd-careful-primitives.h`

.

static void inner_quickTwoSum (const double x, const double y, const double s, double * out_head, double * out_tail) { #pragma STDC FENV_ACCESS on const double bb = s - x; const double err = y - bb; *out_head = s; *out_tail = err; } static void quickTwoSum (const double x, const double y, double * out_head, double * out_tail) { #pragma STDC FENV_ACCESS on fenv_t env; feholdexcept (&env); const double s = x + y; if (isinf (s) || s == 0.0 || isnan (s)) { *out_head = *out_tail = s; feupdateenv (&env); return; } inner_quickTwoSum (x, y, s, out_head, out_tail); fesetenv (&env); } static void twoSum (const double x, const double y, double * out_head, double * out_tail) { #pragma STDC FENV_ACCESS on fenv_t env; feholdexcept (&env); const double s = x + y; if (isinf (s) || s == 0.0 || isnan (s)) { *out_head = *out_tail = s; feupdateenv (&env); return; } if (fabs (x) < fabs(y)) inner_quickTwoSum (y, x, s, out_head, out_tail); else inner_quickTwoSum (x, y, s, out_head, out_tail); fesetenv (&env); } static void quickTwoDifference (const double x, const double y, double * out_head, double * out_tail) { quickTwoSum (x, -y, out_head, out_tail); } static void twoDifference (const double x, const double y, double * out_head, double * out_tail) { twoSum (x, -y, out_head, out_tail); } static void twoProduct (double a, double b, double *out_head, double *out_tail) { #pragma STDC FENV_ACCESS on fenv_t env; int fe_flags; feholdexcept (&env); double p = a * b; if (isinf (p) || p == 0.0 || isnan (p)) { *out_head = *out_tail = p; feupdateenv (&env); return; } double t = fma (a, b, -p); fe_flags = fetestexcept (FE_INEXACT | FE_UNDERFLOW); if (t == 0.0) t = copysign (0.0, p); *out_head = p; *out_tail = t; fesetenv (&env); feraiseexcept (fe_flags); }

An example of double-double arithmetic operations as well as wide summation and dot product is in `dd-example.c`

.

#include <stdlib.h> #include <stdio.h> #include <float.h> #include <math.h> #include <fenv.h> #if !defined(CAREFUL_PRIMITIVES) #include "dd-primitives.h" #else #include "dd-careful-primitives.h" #endif void dd_add(double a_head, double a_tail, double b_head, double b_tail, double *out_head, double *out_tail) { /* This satisfies an IEEE754-style error bound, due to K. Briggs and W. Kahan. */ double hh, ht, th, tt; twoSum (a_head, b_head, &hh, &ht); twoSum (a_tail, b_tail, &th, &tt); ht += th; quickTwoSum (hh, ht, &hh, &ht); ht += tt; quickTwoSum (hh, ht, out_head, out_tail); } void dd_add_d(double a_head, double a_tail, double b, double *out_head, double *out_tail) { double h, t; twoSum (a_head, b, &h, &t); t += a_tail; quickTwoSum (h, t, out_head, out_tail); } void dd_mult (double a_head, double a_tail, double b_head, double b_tail, double *out_head, double *out_tail) { double h, t; twoProduct (a_head, b_head, &h, &t); t += a_head * b_tail; t += a_tail * b_head; quickTwoSum (h, t, out_head, out_tail); } void dd_mult_d (double a_head, double a_tail, double b, double *out_head, double *out_tail) { double h, t; twoProduct (a_head, b, &h, &t); t += a_tail * b; quickTwoSum (h, t, out_head, out_tail); } static void scalar_widesum (size_t N, const double *a, double *s_out, double *r_out) { /* Compensated summation, Sum2s in Ogita, Rump, & Oishi. */ double s, r; if (N == 0) { *s_out = 0.0; *r_out = 0.0; return; } if (N == 1) { *s_out = *a; *r_out = 0.0; return; } s = a[0]; r = 0.0; for (size_t k = 1; k < N; ++k) { double t; twoSum (s, a[k], &s, &t); r += t; } *s_out = s; *r_out = r; } static void scalar_widedot (size_t N, const double *a, const double *b, double *s_out, double *r_out) { double s, r; if (N == 0) { *s_out = 0.0; *r_out = 0.0; return; } if (N == 1) { twoProduct (*a, *b, s_out, r_out); return; } twoProduct (*a, *b, &s, &r); for (size_t k = 1; k < N; ++k) { double p, t, tt; twoProduct (a[k], b[k], &p, &t); twoSum (p, s, &s, &tt); r += tt; twoSum (t, s, &s, &tt); r += tt; } *s_out = s; *r_out = r; } #if !defined(NV) #define NV 4 #endif double widesum (size_t N, double *a_in) { const double * restrict a = a_in; double s[NV], r[NV]; size_t outer_k; if (N < 2*NV) { /* More likely to be 4*NV or 8*NV */ scalar_widesum (N, a, s, r); return s[0] + r[0]; } #pragma omp simd for (size_t k = 0; k < NV; ++k) { s[k] = a[k]; r[k] = 0.0; } for (outer_k = NV; outer_k < N; outer_k += NV) { const double * restrict av = &a[outer_k]; #pragma omp simd for (size_t k = 0; k < NV; ++k) { double t; twoSum (s[k], av[k], &s[k], &t); r[k] += t; } } if (outer_k != N) { outer_k -= NV; const double * restrict av = &a[outer_k]; for (size_t k = 0; k < N - NV; ++k) { double t; twoSum (s[k], av[k], &s[k], &t); r[k] += t; } } for (size_t k = 1; k < NV; ++k) { double t; twoSum (s[0], s[k], &s[0], &t); r[0] += t; r[0] += r[k]; } return s[0] + r[0]; } double widedot (size_t N, double *a_in, double *b_in, double *head_out, double *tail_out) { const double * restrict a = a_in; const double * restrict b = b_in; double s[NV], r[NV]; size_t outer_k; if (N < 2*NV) { /* More likely to be 4*NV or 8*NV */ scalar_widedot (N, a, b, s, r); goto out; } #pragma omp simd for (size_t k = 0; k < NV; ++k) twoProduct (a[k], b[k], &s[k], &r[k]); for (outer_k = NV; outer_k < N; outer_k += NV) { const double * restrict av = &a[outer_k]; const double * restrict bv = &b[outer_k]; #pragma omp simd for (size_t k = 0; k < NV; ++k) { double p, t, tt; twoProduct (av[k], bv[k], &p, &t); twoSum (s[k], p, &s[k], &tt); r[k] += tt; twoSum (s[k], t, &s[k], &tt); r[k] += tt; } } if (outer_k != N) { outer_k -= NV; const double * restrict av = &a[outer_k]; const double * restrict bv = &b[outer_k]; for (size_t k = 0; k < N - NV; ++k) { double p, t, tt; twoProduct (av[k], bv[k], &p, &t); twoSum (s[k], p, &s[k], &tt); r[k] += tt; twoSum (s[k], t, &s[k], &tt); r[k] += tt; } } for (size_t k = 1; k < NV; ++k) { double t; twoSum (s[0], s[k], &s[0], &t); r[0] += t; r[0] += r[k]; } out: if (head_out && tail_out) { quickTwoSum (s[0], r[0], head_out, tail_out); } return s[0] + r[0]; } static void print_excepts (int); #define TEST(x) do { \ feclearexcept (FE_ALL_EXCEPT); \ x; \ raised = fetestexcept (FE_ALL_EXCEPT); \ } while (0) int main (void) { #pragma STDC FENV_ACCESS on double head, tail; double head_diff, tail_diff; double a = DBL_MAX, b = DBL_MAX; int raised; TEST(twoSum (a, b, &head, &tail)); printf ("twoSum (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); TEST(quickTwoSum (a, b, &head, &tail)); printf ("quickTwoSum (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); TEST(twoProduct (a, b, &head, &tail)); printf ("twoProduct (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); /* Thanks to Sylvie Boldo, Stef Graillat, Jean-Michel Muller for this annoying example. */ a = ldexp (-3.0, DBL_MAX_EXP-DBL_MANT_DIG-1); TEST(twoSum (a, b, &head, &tail)); printf ("\ntwoSum (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); TEST(twoSum (b, a, &head, &tail)); printf ("twoSum (%g, %g) = (%g, %g)\n", b, a, head, tail); print_excepts (raised); TEST(twoSum (-a, -b, &head, &tail)); printf ("twoSum (%g, %g) = (%g, %g)\n", -a, -b, head, tail); print_excepts (raised); TEST(twoSum (-b, -a, &head, &tail)); printf ("twoSum (%g, %g) = (%g, %g)\n", -b, -a, head, tail); print_excepts (raised); twoSum (a, -b, &head, &tail); twoDifference (a, b, &head_diff, &tail_diff); if (head != head_diff || tail != tail_diff) printf ("twoSum (%g, %g) != twoDifference (%g, %g)\n", a, -b, a, b); quickTwoSum (a, -b, &head, &tail); quickTwoDifference (a, b, &head_diff, &tail_diff); if (head != head_diff || tail != tail_diff) printf ("quickTwoSum (%g, %g) != quickTwoDifference (%g, %g)\n", a, -b, a, b); TEST(quickTwoSum (b, a, &head, &tail)); printf ("\nquickTwoSum (%g, %g) = (%g, %g)\n", b, a, head, tail); print_excepts (raised); TEST(quickTwoSum (-b, -a, &head, &tail)); printf ("quickTwoSum (%g, %g) = (%g, %g)\n", -b, -a, head, tail); print_excepts (raised); TEST(twoProduct (a, b, &head, &tail)); printf ("\ntwoProduct (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); a = DBL_MIN; b = -nextafter (DBL_MIN, HUGE_VAL); TEST(twoSum (a, b, &head, &tail)); printf ("\ntwoSum (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); TEST(quickTwoSum (a, b, &head, &tail)); printf ("quickTwoSum (%g, %g) = (%g, %g)\n", a, b, head, tail); print_excepts (raised); const double zero1[] = { 0.0, -0.0, 0.0, -0.0 }; const double zero2[] = { 0.0, 0.0, -0.0, -0.0 }; const int NZ = sizeof (zero1) / sizeof (*zero1); for (int k = 0; k < NZ; ++k) { TEST(twoSum (zero1[k], zero2[k], &head, &tail)); printf ("\ntwoSum (%g, %g) = (%g, %g)\n", zero1[k], zero2[k], head, tail); print_excepts (raised); TEST(quickTwoSum (zero1[k], zero2[k], &head, &tail)); printf ("quickTwoSum (%g, %g) = (%g, %g)\n", zero1[k], zero2[k], head, tail); print_excepts (raised); TEST(twoProduct (zero1[k], zero2[k], &head, &tail)); printf ("twoProduct (%g, %g) = (%g, %g)\n", zero1[k], zero2[k], head, tail); print_excepts (raised); } double v[] = { DBL_MAX, DBL_MAX, DBL_MAX, DBL_MAX, DBL_MAX, DBL_MAX, DBL_MAX, DBL_MAX }; double out; const int N = sizeof (v) / sizeof (*v); TEST(out = widesum (N, v)); printf ("\nwidesum (v) = %g\n", out); print_excepts (raised); TEST(out = widedot (N, v, v, NULL, NULL)); printf ("widedot (v, v) = %g\n", out); print_excepts (raised); /* The next operation cannot be an exact transformation in the same format. */ TEST(twoProduct (DBL_MIN, DBL_MIN, &head, &tail)); printf ("\ntwoProduct (%g, %g) = (%g, %g)\n", DBL_MIN, DBL_MIN, head, tail); print_excepts (raised); /* Inexact must not be raised, underflow may. */ TEST(twoProduct (DBL_MIN, 0.5, &head, &tail)); printf ("\ntwoProduct (%g, %g) = (%g, %g)\n", DBL_MIN, 0.5, head, tail); print_excepts (raised); /* Assuming gradual underflow so a is not DBL_MIN, this underflows and is inexact */ a = DBL_MIN + nextafter (0.0, HUGE_VAL); TEST(twoProduct (a, 0.5, &head, &tail)); printf ("\ntwoProduct (%g, %g) = (%g, %g) ... %g\n", a, 0.5, head, tail, a-DBL_MIN); print_excepts (raised); /* The opposite-signed zero tail in the following case is aggravating. */ a = nextafter (ldexp (1.0, DBL_MIN_EXP), -HUGE_VAL); b = 0.125; TEST(twoProduct (a, b, &head, &tail)); printf ("twoProduct (%g, %g) = (%g, %g) ... %g\n", a, b, head, tail, a - head/b); print_excepts (raised); b = 0.5; TEST(twoProduct (a, b, &head, &tail)); printf ("twoProduct (%g, %g) = (%g, %g) ... %g\n", a, b, head, tail, a - head/b); print_excepts (raised); b = -b; TEST(twoProduct (a, b, &head, &tail)); printf ("twoProduct (%g, %g) = (%g, %g) ... %g\n", a, b, head, tail, a - head/b); print_excepts (raised); double v2[N]; for (int k = 0; k < N; ++k) { v[k] = a; v2[k] = b; } double tmp_head = NAN, tmp_tail = NAN; TEST(widedot (2, v, v2, &tmp_head, &tmp_tail)); /* Cilk+ slice notation is [start:length]. */ printf ("\nwidedot (v[0:2], v2[0:2]) = (%g, %g)\n", tmp_head, tmp_tail); print_excepts (raised); } void print_excepts (int raised) { int anyout = 0; if (raised & FE_INVALID) anyout = printf (" invalid"); if (raised & FE_UNDERFLOW) anyout = printf (" underflow"); if (raised & FE_OVERFLOW) anyout = printf (" overflow"); if (raised & FE_INEXACT) anyout = printf (" inexact"); if (anyout) printf ("\n"); }

For the initial primitives, sufficiently recent gcc versions vectorize the loops marked `omp simd`

with optimization option `-O3`

on at least an Intel Skylake-based platform. The careful primitives in `dd-careful-primitives.h`

cannot be vectorized so easily.

An eyeball analysis makes me believe that the summation in `dd-example.c`

obeys a slightly weaker bound than Sum2s in Ogita, Rump, & Oishi: |S| ε + (N + NV) ε^{2} rather than |S| ε + N ε^{2} where S is the true sum. If multi-threaded with NT threads and using the same naïve reduction, increase N + NV to N + NV + NT. This no doubt can be improved with some thought.

## Footnotes:

^{1}

Priest, D. M. (1991, June). Algorithms for arbitrary precision floating point arithmetic. In the Proceedings of the 10th IEEE Symposium on Computer Arithmetic, 1991. (pp. 132-143). doi:10.1109/ARITH.1991.145549

^{2}

Boldo, S., Graillat, S., and Muller, J-M. "On the robustness of the 2Sum and Fast2Sum algorithms." https://hal-ens-lyon.archives-ouvertes.fr/ensl-01310023

^{3}

Peter Ahrens, James Demmel, and Hong Diep Nguyen. "Exploring ways to support reproducible summation in a future IEEE 754 Standard." Distributed by email.

^{4}

Marat Dukhan, Rich Vuduc, and Jason Riedy. "Wanted: Floating-Point Add Round-off Error Instruction." arXiv:1603.00491

^{5}

Yamazaki, I., Tomov, S., and Dongarra, J. "Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs" SIAM J. Sci. Comput. 37(3): C307-C330 (2015). doi:10.1137/14M0973773