

## Monotonicity of Multi-Term Floating-Point Adders

#### Mantas Mikaitis

#### School of Computing, University of Leeds, Leeds, UK

#### AriC Team Seminar, Laboratoire LIP Lyon, France, Apr. 11, 2024 Slides: https://mmikaitis.github.io/talks



Summation over real values

$$f(x_1, x_2, ..., x_n) = \sum_{i=1}^n x_i,$$

is monotonic because for any  $x_i < x_i^*$  we have that

$$f(x_1,...,x_n) \leq f(x_1^*,...,x_n^*).$$

#### Monotonicity

With multivariate monotonic functions, if one or more of the arguments is increased, the function also increases or stays constant.

When computing summation in floating-point arithmetic we would like to preserve monotonicity of the sum.

Some of the known properties of floating point:

- $\checkmark$  Commutativity:  $fl(a \times b) = fl(b \times a)$ .
- × Associativity:  $fl(a + fl(b + c)) \neq fl(fl(a + b) + c)$ .
- × Distributivity:  $fl(a \times fl(b + c)) \neq fl(fl(a \times b) + fl(a \times c))$ .
- ? Monotonicity

#### By the end of this talk...

learn about the monotonicity of floating-point sum and how it is affected by the mathematical hardware computing it.

# Addition of multiple numbers in floating point

We are used to adding numbers in pairs, using a two-term FP adders.

IEEE 754 adder computes as though in infinite precision, then normalizes and rounds.

Multi-term adders

Latest hardware includes specialized **multi-term adders** alongside the standard **two-term addition**.



## Standard model [Higham, 2002]

Floating-point addition defined with

$$fl(x + y) = (x + y)(1 + \delta), \quad |\delta| \le 2^{-p}$$

where fl() refers to normalizing and rounding x + y to form a floating-point value defined by IEEE 754.

Following classes of multi-term adders are present in current hardware literature:

- Class I (exact "Kulisch" accumulator) and Class II (compute sticky bits correctly [Tenca, 2009]): fl(x<sub>1</sub> + x<sub>2</sub> + ... + x<sub>n</sub>).
- Class III (chain of two-term adders):  $fl(fl(\cdots fl(x_1 + x_2) + \cdots) + x_n).$

- Rounding fl(x) is monotonic by definition.
- Addition fl(x + y) is monotonic.
- Summation fl(fl(···fl(x<sub>1</sub> + x<sub>2</sub>) + ···) + x<sub>n</sub>) is monotonic for any n (Class III).
- Multiplication  $fl(x \times y)$  is monotonic.
- Inner product fl(···fl(fl(a<sub>1</sub> × b<sub>1</sub>) + fl(a<sub>2</sub> × b<sub>2</sub>)) + ··· + fl(a<sub>n</sub> × b<sub>n</sub>)) is monotonic (Class III).
- Fused computations  $fl(x_1 + x_n + \cdots + x_n)$  are monotonic (**Class I/II**).

#### Proofs

The proofs of these come down to the monotonicity of rounding and work with the main IEEE 754 rounding modes: **round-to-nearest**, **round-towards-zero**, **round-up**, and **round-down**.

But there is a catch: changing order can break monotonicity because in general  $fl(a + fl(b + c)) \neq fl(fl(a + b) + c)$ .

Example in binary32: a = 1, fl(fl(fl( $a + 2^{-24}) + 2^{-24}$ ) +  $2^{-24}$ ) +  $2^{-24}$ ) = 1.

Now **decrease** *a* to  $a = 1 - 2^{-24}$  and change the order of evaluation:  $fl(fl(fl(2^{-24} + 2^{-24}) + 2^{-24}) + 2^{-24}) + a) = 1 + 2^{-22}$ .

Community knows this behaviour well and expects this.

But if order is unchanged this is not expected to happen.

- A binary floating-point number x has the form  $(-1)^s imes m imes 2^{e-p+1}$
- s is the sign bit, p is the precision,  $m \in [0, 2^p 1]$  is the integer significand, and  $e \in [e_{\min}, e_{\max}]$ , with  $e_{\min} = 1 e_{\max}$ , is the integer exponent.
- The number system is *normalized* so that the most significant bit of m is always set to 1 if  $|x| \ge 2^{e_{\min}}$ .
- Floating-point numbers with  $2^{p-1} \le m \le 2^p 1$  are normalized.

#### Normalization

The result of an operation must be normalized by **shifting the significand** left or right until it falls within the interval  $[2^{p-1}, 2^p - 1]$  and **adjusting the exponent** accordingly.

#### Modified standard model

We define an adder that starts with some precision p and can grow precision when carries occur in floating-point significand addition.

$$\ln(a+b) = \begin{cases} fl_p(a+b) & \text{if } |a+b| < t, \\ fl_{p+1}(a+b) & \text{if } |a+b| \ge t, \end{cases}$$
(1)

with  $|a| \ge |b|$ ,  $t = 2^{1 + \lfloor \log_2 |a| \rfloor}$ , power of two nearest to |a| with |a| < |t|.

- When this adder is joined to compute expressions such as  $flr(flr(x_1 + x_2) + x_3)$ , precision increases propagate.
- Sum with *n* additions can grow precision from *p* to  $p + \lceil log_2n \rceil$ .
- Similar device used by [Ashenhurst and Metropolis, 1959] for error analysis.

# Class IV multi-term adders

**Algorithm 0:** Given numbers  $\{x_1, ..., x_n\}$ , with exponents  $\{e_1, ..., e_n\}$  and significands  $\{1.m_1, ..., 1.m_n\}$  approximate  $s_n = \sum_{i=1}^n x_i$ .

Determine  $e_{max} = \max(e_1, ..., e_n)$ Align all  $1.m_i$  by shifting each  $e_{max} - e_i$  steps right Perform addition of aligned significands Perform normalization and rounding to form  $s_n$ 

We can partially model **Class IV** adders:  $fl(flr(\cdots flr(x_1 + x_2) + \cdots) + x_n)$ .

Hardware properties:

- start in limited precision accumulator *p*,
- do not normalize and round until the computation is finished,
- due to carries, grow effective precision.

Note flr() does not account for lack of normalization after cancellation, where precision loss occurs. Not addressed in this work.

# Monotonicity: example on current hardware

Originally observed in [Fasi, Higham, Pranesh, Mikaitis, 2021].

Recent GPUs contain hardware for  $D = A \times B + C$  where  $A \in \mathbb{R}^{8 \times 8}$  and  $B \in \mathbb{R}^{8 \times 4}$  are binary16 matrices,  $C, D \in \mathbb{R}^{8 \times 4}$  are binary32 matrices.

We will focus on two result elements in D:

• 
$$d_{11} = a_{11}b_{11} + a_{12}b_{21} + \dots + a_{18}b_{81} + c_{11}$$

• 
$$d_{12} = a_{11}b_{12} + a_{12}b_{22} + \cdots + a_{18}b_{82} + c_{12}$$

We set A, B = 1 (matrices of ones) and  $c_{11} = 33554430$  and  $c_{12} = 33554432$ .

Computing  $A \times B + C$  with a GPU returns a matrix that has  $d_{11} = 33554436$  and  $d_{12} = 33554432$ .

Since  $c_{11} < c_{12}$  but  $d_{11} > d_{12}$  the 9-term sum is nonmonotonic.

# Commercial devices

| Year      | Device/Architecture | Input formats                                                   | Output formats     | Terms | Predicted class |
|-----------|---------------------|-----------------------------------------------------------------|--------------------|-------|-----------------|
| 2016      | Google TPUv2        | bfloat16                                                        | binary32           | -     | Class III       |
| 2017      | Google TPUv3        | bfloat16                                                        | binary32           | -     | Class III       |
| 2018      | NVIDIA V100         | binary16                                                        | binary32           | 5     | Class IV        |
| 2018      | Graphcore IPU1      | binary16                                                        | binary32           | -     | -               |
| 2020      | Google TPUv4i       | bfloat16                                                        | binary32           | 4     | Class IV        |
| 2020      | Graphcore IPU2      | binary16                                                        | binary32           | -     | -               |
| 2020      | NVIDIA A100         | bfloat16,<br>binary16,<br>binary64,<br>TensorFloat-32           | binary32/64        | 9     | Class IV        |
| 2021      | AMD MI250X          | bfloat16,<br>binary16,<br>binary32,<br>binary64                 | -                  | 5     | -               |
| 2021      | GrogChip            | binary16                                                        | binary32           | 160   | Class I or II   |
| 2022      | NVIDIA H100         | 8-bit,<br>bfloat16,<br>binary16,<br>binary64,<br>TensorFloat-32 | binary32, binary64 | 17    | -               |
| 2022      | Intel Ponte Vecchio | bfloat16,<br>binary16,<br>binary64,<br>TensorFloat-32           | -                  | -     | -               |
| 2016-2022 | Intel AMX           | binary16                                                        | binary32           | 17    | Class III       |
| 2023      | Tesla Dojo          | CFP8, bfloat16                                                  | binary32           | 8     | Class IV        |
| 2024      | NVIDIA Blackwell    | block 4/6/8-bit,<br>[]                                          | binary32, binary64 | -     | -               |

- Addition flr(x + y) is monotonic with round-to-nearest, round-towards-zero, round-up, and round-down.
- Addition of three operands flr(flr(x<sub>1</sub> + x<sub>2</sub>) + x<sub>3</sub>) is non-monotonic with round-to-nearest, round-toward-zero and round-down, except if rounded to starting precision fl(flr(flr(x<sub>1</sub> + x<sub>2</sub>) + x<sub>3</sub>)).

Proof:

- Three consecutive positive FP numbers a, b (a power of 2), c.
- a < b < c,  $\varepsilon = \frac{c-b}{2}$ .
- $\operatorname{flr}(b + \varepsilon) = b$  and  $\operatorname{flr}(\operatorname{flr}(b + \varepsilon) + \varepsilon) = b$  with RN, RD, RZ.
- $flr(a + \varepsilon) = b$  (precision grows).
- $flr(flr(a + \varepsilon) + \varepsilon) > b$  due to precision growth in the first add.
- $fl(flr(a+\varepsilon)+\varepsilon))$ : OK.

#### Main result

Summation  $\operatorname{flr}(\cdots \operatorname{flr}(x_1 + x_2) + \cdots) + x_n)$ , with  $x_i \in \mathbb{R}$  and  $n \ge 4$  is not monotonic with round-to-nearest, round-towards-zero, and round-toward-negative  $(x_i > 0)$  or round-toward-positive  $(x_i < 0)$ , with and without the final rounding to the starting precision.

# Proof

- Three consecutive positive FP numbers a, b (a power of 2), c.
- a < b < c,  $\varepsilon = \frac{c-b}{2}$ .
- With RN,  $\operatorname{flr}(b + \varepsilon) = b$  for  $\varepsilon \leq (c b)/2$ , while in precision-(p + 1) arithmetic  $\operatorname{flr}(b + \varepsilon) = b$  for  $\varepsilon \leq (c b)/4$ .
- Also, in precision-p arithmetic flr(a + (c b)/2) = b.

Consider  $\operatorname{flr}(\operatorname{flr}(x + \varepsilon) + \varepsilon) + \varepsilon)$  in two cases:

- x = b, then flr(flr( $b + \varepsilon$ ) +  $\varepsilon$ ) +  $\varepsilon$ ) = b (all in precision-p).
- x = a, then the first addition flr(a + ε) = b (and precision increases to p + 1 since b is a power of two).
  - the second addition  $flr(b + \varepsilon) = b + \varepsilon$  (in precision p + 1);
  - the third addition  $flr(b + \varepsilon + \varepsilon) = c$  (in precision p + 1).

When x = b, sum evaluates to b, but when x = a < b, sum evaluates to c > b. Final rounding to p does not change the results.  $\Box$ 

- Various packages available: chop, FLOATP, QPyTorch.
- Usual approach is to perform ops in binary32/64 HW.
- Round down to sub-32-bit precision: careful with double rounding.
- We believe ours is most customizable and fastest: CPFloat [Fasi & Mikaitis, 2023].
- Can be used in MATLAB, Octave or C.

# Example with CPFloat in MATLAB/Octave

```
>> options.format = 'binary16';
>> [~,options] = cpfloat(0, options)
options =
  struct with fields:
       format: 'binary16'
       params: [11 15]
    subnormal: 1
        round: 5
         flip: 0
            p: 0.5000
       explim: 1
>> cpfloat(pi, options)
ans =
    3.1406
>> options.params(1) = options.params(1) + 1;
>> cpfloat(pi, options)
ans =
    3.1426
```

# Numerical experiments

• We simulated Class IV multi-term adders with MATLAB, using the custom precision simulator CPFloat [Fasi and Mikaitis, 2023].

Compute

$$\mathrm{fl}(\cdots \mathrm{fl}(x_1+x_2)+\cdots)+x_n)$$

and

$$fl(flr(\cdots flr(x_1 + x_2) + \cdots) + x_n))$$

in three small floating-point systems: p = 3,  $e_{max} = 3$ ; p = 4,  $e_{max} = 3$ ; and p = 5,  $e_{max} = 4$ .

- Set all x<sub>i</sub> = 0.25 and then vary x<sub>1</sub> by changing it to the adjacent floating-point value towards +∞ until all representable values are covered.
- Each time we change x<sub>1</sub> we sum the values x<sub>i</sub> with IEEE 754 ops and with the Class IV adder.
- Relative error compared with the same sum performed in binary64 arithmetic.

# Numerical experiments with p = 3, $e_{max} = 3$



## Numerical experiments with p = 4, $e_{max} = 3$



# Numerical experiments with p = 5, $e_{max} = 4$



# Impact of addend ordering on accuracy (binary16)



We may get into trouble if we use multi-term adders in computing  $\sqrt{\sum_{i=1}^{n} a_i - \sum_{j=1}^{k} b_j}$  when we know  $\sum_{i=1}^{n} a_i \ge \sum_{j=1}^{k} b_j$ .

For example in binary32, with a = [1, 1, 1, 1, 1, 1, 1, 16777216], b = [1, 1, 1, 1, 1, 1, 16777214]

- With a 8-term Class IV adder we get  $\sqrt{-4} = \text{NaN}$ .
- With IEEE 754 addition we can sort and avoid the NaN.

Compute interval of sum through the change of rounding modes (RD and RU) in Class IV multi-term adder.

Take a = [16777216, 1, 1, 1, 1, 1, 1], b = [16777214, 1, 1, 1, 1, 1, 1]

Interval of the sum of *a* is [16777216, 16777230].

Interval of the sum of b is [16777220, 16777222].

Decreasing one addend, the lower end of interval shifts up; interval narrows due to precision growth.

- Class I/II (Kulisch, Tenca): associative, monotonic.
- **Class III** (chain of IEEE 754 two-term adders): not associative, monotonic.
- **Class IV** (limited precision, no intermediate normalization): associative, not monotonic

- Monotonicity important in bisection [Demmel, Dhilon, Ren, 1995]; solving quadratic equations [Higham, 2002]; mathematical functions.
- IEEE 754-2019 recommends *reduction operations*, but does not specify details; may consider revisiting for IEEE 754-2029.

#### Paper

M. Mikaitis. *Monotonicity of Multi-Term Floating-Point Adders*. IEEE Trans. Comput. Feb. 2024. Early view.

# Leeds Mathematical Software and Hardware Lab

#### New informal group in the School of Computing, Univ. Leeds.



Massimiliano Fasi Lecturer Research&Teaching



Mantas Mikaitis Lecturer Research&Teaching



- Focusing on computer arithmetic, numerical linear algebra, high-performance computing.
- Working with IEEE P3109 and IEEE 754-2029.
- Serving on PC committees of ARITH.
- Planning MSc module on computer arithmetic.
- PhD studentships available.

#### Professor Nicholas J. Higham (1961–2024).

I am grateful to Nicolas Brunie, Max Fasi, and three anonymous referees of IEEE TC for comments and suggestions.

#### N. J. Higham

Accuracy and Stability of Numerical Algorithms. 2nd edition SIAM. 2002

# A. F. Tenca

Multi-operand floating-point addition 19th IEEE Symp. Comput. Arithmetic. 2009

R. L. Ashenhurst and N. Metropoli Unnormalized floating point arithmetic J. ACM, 6. 1959.



M. Fasi, N. J. Higham, S. Pranesh, and M. Mikaitis Numerical behavior of NVIDIA tensor cores PeerJ Comput. Sci., 7. 2021

### M. Fasi and M. Mikaitis CPFloat: A C library for simulating low-precision arithmetic ACM Trans. Math. Software, 49. 2023

#### J. W. Demmel, I. Dhillon, and H. Ren

On the correctness of some bisection-like parallel eigenvalue algorithms in floating point arithmetic Electron. Trans. Numer. Anal., 3. 1995