# Basic Issues in Floating Point Arithmetic and Error Analysis

## Introduction and Motivation

This lecture is a quick survey of issues in floating point arithmetic relevant to numerical linear algebra, including parallel numerical linear algebra. More detailed material on floating point may be found in Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic. by W. Kahan.

The IEEE floating point standards prescribe precisely how floating point numbers should be represented, and the results of all operations on floating point numbers. There are two standards: IEEE 754 is for binary arithmetic, and IEEE 854 covers decimal arithmetic as well. We will only discuss IEEE 754, which has been adopted almost universally by computer manufacturers. Indeed, except for older, obsolete architectures (such as IBM 370 and VAX), and the Cray XMP, YMP, 2, C90 and first generation T90, all other machines, including PCs, workstations, and parallel machines built from the same microprocessors, use IEEE arithmetic. The next Cray T90 is supposed to implement IEEE arithmetic.

So the good news is that essentially all new computers will be using the same floating point arithmetic in the near future, and almost all do now, with the important exception of Cray. This simplifies code development, especially portable code development, as well as some important rounding error analyses, as we will illustrate below.

Unfortunately, not all manufacturers implement every detail of IEEE arithmetic the same way. Everyone does indeed represent numbers with the same bit patterns, and rounds results correctly (or tries to!). But exceptional conditions (like 1/0, sqrt(-1) etc.), whose semantics are also specified in detail by the IEEE standard, are not always handled the same way. It turns out that many manufacturers believe (sometimes rightly and sometimes wrongly) that conforming to every detail of IEEE arithmetic would make their machines either a bit slower or a bit more expensive, enough so make them less attractive in the marketplace. In particular, machines are often sold on the basis of how fast they execute certain standard benchmarks, like the Linpack Benchmark, which do not generate exceptions, so exception handling often gets short shrift. We will discuss the impact of this on code development.

## Basic Formats and Operations

The IEEE standard 754 for binary arithmetic specifies 4 floating point formats, single, single extended, double, and double extended. PCs based on Intel x87 chips support single, double and double extended, and most other implementations support only single and double. Single extended is very rare. Some machines support quadruple in software, which is a particular instantiation of double extended, as described below. The following table describes the range of normalized numbers representable in the various formats. Floating point numbers are represented in the form +-significand * 2^(exponent), where the significand is a nonnegative number. A normalized significand lies in the half open interval [1,2), i.e. it has no leading zero bits.
```                              Floating point formats

Name of format   Bytes to  Number of bits       Macheps      Number of        Approx range
store     in significand                    exponent bits
--------------   --------  --------------       -------      -------------    ------------
Single              4            24          2^(-24) ~ 1e-7        8            10^(+-38)
Double              8            53          2^(-53) ~ 1e-16      11          >=10^(+-308)
Double extended  >=10          >=64        <=2^(-64) ~ 1e-19    >=15          >=10^(+-4932)
Quadruple          16           113          2^(-113)~ 1e-34      15            10^(+-4932)
```
Macheps is short for "machine epsilon", and is used below for roundoff error analysis. The distance between 1 and the next larger floating point number is 2*macheps.

When the exponent has neither its largest possible value (a string of all ones) nor its smallest possible value (a string of all zeros), then the significand is necessarily normalized, and lies in [1,2).

When the exponent has its largest possible value (all ones), the floating point number is either +infinity, -infinity, or NAN (not-a-number); we discuss these further below under exceptions. The largest finite floating point number is called the overflow threshold.

When the exponent has its smallest possible value (all zeros), the significand may have leading zeros, and is called either subnormal or denormal (unless it is exactly zero). The subnormal floating point numbers represent very tiny numbers between the smallest nonzero normalized floating point number (the underflow threshold) and zero. An operation that underflows yields a subnormal number or possibly zero; this is called gradual underflow. The alternative, simply returning a zero, is called flush to zero. When the significand is zero, the floating point number is +-0.

The basic operations specified by IEEE arithmetic are first and foremost addition, subtraction, multiplication and division. Square roots and remainders are also included. The default rounding for these operations is "to nearest even". This means that the floating point result fl(a op b) of the exact operation (a op b) is the nearest floating point number to (a op b), breaking ties by rounding to the floating point number whose bottom bit is zero (the "even" one). It is also possible to round up, round down, or truncate (round towards zero). Rounding up and down are useful for interval arithmetic, which can provide guaranteed error bounds; unfortunately most languages and/or compilers provide no access to the status flag which can select the rounding direction.

Other operations include binary-decimal conversion, floating point-integer conversions, and conversions among floating point formats, and comparisons.

## Basic Error Analysis

The basic formula for error analysis is
```(*)   fl(a op b) = (1+ eps)*(a op b)    provided fl(a op b) neither overflows nor underflows
```
where
```      (a op b) is the exact result of the operation, where op is one of +, -, * and /
fl(a op b) is the floating point result
| eps | <= macheps
```
To incorporate underflow, we may write
```   fl(a op b) = (1+ eps)*(a op b) + eta    provided fl(a op b) does not overflow
```
where
```       | eta | <= macheps * underflow_threshold
```
On machines which do not implement gradual underflow (including some IEEE machines which have an option to perform flush-to-zero arithmetic instead), the corresponding error formula is
```       | eta | <= underflow_threshold
```
i.e. the error is 1/macheps times larger.

### Error analysis on the Cray

We mentioned above that the Cray does not implement IEEE arithmetic. On the Cray the error analysis formula for addition and subtraction differs from the one above:
```(**)  fl(a +- b) = ( (1+eps1)*a ) +- ( (1+eps2)*b )    provided fl(a +- b ) neither overflows
nor underflows
```
In particular, this means that if a and b are nearly equal, so a-b is much smaller than either a or b (this is called extreme cancellation), then the relative error may be quite large. For example when subtracting 1-x, where x is the next smaller floating point number than 1, the answer is twice too large on a Cray C90, and 0 when it should be nonzero on a Cray 2.

This happens because when y's significand is shifted right to line its binary point up with 1's binary point, prior to actually subtracting the significands, any bits shifted past the least significant bit of 1 are simply thrown away (differently on a C90 and Cray 2), rather than participating in the subtraction. On IEEE machines, there is a so-called guard digit (actually 3 of them) to store any bits shifted this way, so they can participate in the subtraction.

This source of error has little effect on most problems in numerical linear algebra, but there are some important exceptions, as illustrated below.

## Numerical Stability and Error Analysis

Here is the standard paradigm for analyzing the error in an algorithm. For simplicity, we let f(x) be an exact mathematical function we wish to compute, where both x and f(x) are single real numbers. We let alg(x) be our algorithm, including all errors from roundoff, truncation error, etc. We wish to bound the error
```        error =  alg(x) - f(x)
```
Rather than trying to compute the error directly, we proceed as follows.

We first try to show that alg(x) is numerically stable, that is alg(x) is nearly the exact result f(x+e) of a slightly perturbed problem x+e:

```        alg(x) ~ f(x+e)   where e is small
```

Second, assuming f(x) is a smooth function at x, we approximate f(x) by a first-order Taylor expansion near x:

```        f(x+e) ~ f(x) + e*f'(x)
```

Combining the last two expressions yields

```         error = alg(x) - f(x)
~ f(x+e) - f(x)
~ f(x) + e*f'(x) - f(x)
= e*f'(x)
```
Thus, we have expressed the error as a product of a small quantity e and the derivative f'(x). e is usually called the backward error of the algorithm. f'(x), which depends only on f and not on the algorithm, is called the condition number. In other words, the error is approximately the product of the backward error and the condition number. This approach is ubiquitous in numerical linear algebra, where x and f(x) are usually vectors and/or matrices, and we use norms to bound the error:
```           || error || <~ || e || * || f'(x) ||
```
(The norm of a vector ||v|| or matrix ||M|| might simply be the root-sum-of-squares of its entries, for example.) Computing || f'(x) || exactly is often as expensive as solving the original problem. Therefore we often settle for an approximation; this is called condition estimation.

Roundoff error analysis is needed to show that e is small. We illustrate with a few examples.

## Example 1: evaluating log(1-x)/x

The function f(x) = log(1-x)/x is smooth for x <= 1. Its graph is shown below If we evaluate log(1-x)/x in IEEE floating point arithmetic using this formula (except for the special case f(0)=-1), the computed graph is shown below. Note that the answer is quite wrong near x=0, with a spike going from -2 to 0: Blowing up the graph near x=0 shows a sort of periodic behavior: Blowing the graph up even more yields the following picture. To hint at what is going on, we have made green tic marks at all x values such that 1-x is an exact floating point number. The red tic marks are exactly half way in between the green tic marks, and are the thresholds where fl(1-x) suddenly changes from one floating point value to the next. When x is tiny, log(1-x) is very close to -x. Said another way, when y is very close to 1, log(y) is very close to y-1. The next graph shows ((1-x)-1)/x, as computed in floating point. In exact arithmetic, this expression simplies to -1. But in floating point, it yields exactly the same picture as before: The next graph shows both (1-x)-1 and -x (both computed in floating point) on the same graph. Their quotient is what was shown in the last graph. It is easy to see why the quotient is far from -1: When x is tiny, fl(1-x) can only take on a few discrete values near 1, so fl(fl(1-x)-1) can in turn only take on a few discrete values, whereas x more nearly takes on a continuum of values. Here is a different algorithm for f(x) which computes it quite precisely in IEEE arithmetic:

```       y=1-x
if y == 1,
f = -1
else
f = log(y)/(1-y)
endif
```
In real arithmetic, the algorithm is clearly equivalent to f = log(1-x)/x (except at x=0, where it has the right limiting value). The graph of this function is shown below; it has no difficulty near x=0. This algorithm is backward stable, because it nearly computes log(1-(x+e))/(x+e), where e is small (on the order of of macheps); this makes the very reasonable assumption that the logarithm function is implemented accurately (this is not difficult). We invite the reader to supply a proof. (Hint: when x is small, there is no roundoff error committed when computing fl(1-y).)

Unfortunately, this improved algorithm is not backward stable on the Cray C90, as illustrated in the graph below, which only shows the graph for x near 0: The subsequent plot is a bigger blowup near x=0, with floating point numbers of the form 1-x marked as before. The following plots show intermediate results in the computation, both on an IEEE machine and on the Cray, for contrast. The final result on an IEEE machine is the quotient of the two values plotted in the two rightmost graphs; their quotient is clearly -1. The final result on a C90 is the quotient of the two values plotted in the two leftmost graphs; their quotient is clearly not -1 for x>0. The problem lies in the fact that fl(1-y) is off by a factor of up to 2 when y is near 1 on a C90, whereas fl(1-y) is exact in IEEE arithmetic. The cure on a C90 is to replace

```     y = 1-x
by
y = .5 + (.5 - x)
and
f = log(y)/(1-y)
by
f = log(y)/(.5+(.5-y))
```

The reader is invited to prove why this works, and to rejoice that the imminent universality of IEEE arithmetic will make such tricks unnecessary.

## Example 2: Finding Eigenvalues by Divide-and-Conquer

Currently the fastest algorithm for finding all eigenvalues and all eigenvectors of a symmetric matrix on a single processor cache-based machine is based on a divide-and-conquer approach originally proposed by Cuppen in 1981, and much refined since then. At its heart is an algorithm for inexpensively computing the eigenvalues and eigenvectors of a matrix in the special form D + r*z*z', where D is diagonal, r is a scalar, z is a vector, and z' is the transpose of z. We also call this the diagonal plus rank-one eigenproblem. The eigenvalues are found by applying a sophisticated version of Newton's method to find the zeros of a rational function which has the same zeros as the characteristic polynomial. Given the eigenvalues, the question is how to compute the eigenvectors. There is an easily confirmed formula for the eigenvector corresponding to a computed eigenvalue lambda, namely
```        x = ( D - lambda*I )^(-1) * z
```
Unfortunately, this formula is quite inaccurate when there are other eigenvalues very close to lambda. In particular, eigenvectors of different, but nearby, eigenvalues are not guaranteed to be orthogonal, which is essential for stability. This difficulty was attacked by Sorensen, Tang, Kahan and others, and proposed solutions were based on using double the input precision for certain critical parts of the calculation. When the input precision is already double, this would require quadruple, which is not widely available; this was the major stumbling block to widespread adoption of this algorithm for many years.

Finally, in 1992 Eisenstat and Gu nearly completely overcame this problem by using a different, stable formula, now incorporated in LAPACK routine sstedc (and slaed3, which sstedc calls). We refer the reader to the paper by Gu and Eisenstat cited below for details. Here we will just examine the kernel of their formula

```(*)       x(i) = prod_{j=1 to n except i}   q(i,j) / ( lambda(i) - lambda(j) )
```
where q(i,j) is some other (accurately) computed quantity. Stability depends on this formula being evaluated accurately.

With IEEE arithmetic we may apply error analysis formula (*) above. Assuming for simplicity that q(i,j) is exact, we see that the computed x(i) differs from the true x(i) by a factor f, where f is the product of 2n-1 different terms of the form 1+eps, |eps| <= macheps. Thus x(i) is computed to high relative accuracy as desired.

With Cray arithmetic, on the other hand, if some difference lambda(i) - lambda(j) undergoes extreme cancellation, it may not be computed accurately. If q(i,j) is also small, the factor q(i,j) / ( lambda(i) - lambda(j) ) may be O(1) and completely wrong, so that x(i) is completely wrong. This can indeed occur, and this temporarily prevented us from incorporating divide-and-conquer into LAPACK, until we figured out what to do.

The fix for this problem was proposed by W. Kahan, who pointed out that if no lambda(i) has a nonzero least significant bit, then when extreme cancellation occurs no information is lost by shifting. Thus, simply setting the bottommost bit of each lambda(i) to zero, a tiny error, would fix the problem. Doing this in a portable way, that would run whether compiled on a Cray, PC or any other machine, required the following trick: It turns out that computing

```       lambda = ( lambda + lambda ) - lambda
```
has no effect on lambda on any binary machine which rounds in any reasonable way (barring overflow). But on a Cray, it makes the bottom bit of lambda zero; the reader is invited to show why.

This still does not explain the ultimate appearance of the LAPACK code. It turns out that many aggressive optimizing compilers will examine the expression above, and simply throw it away, because in exact arithmetic, it obviously would have no effect. To try to prevent this, (lambda+lambda) is actually computed by a function call, where the function is compiled at a lower level of optimization. This is still no guarantee, but we hope that by the time compilers become aggressive enough for this to be a problem, non-IEEE arithmetics will have disappeared.

As a historical note, for a while we considered implementing this algorithm in doubled precision which represents double precision quantities as a pair (upper,lower) of single precision quantities, and manipulates these pairs. The attraction is that if the input precision is already double, this simulates quadruple, letting us use the same algorithm in all precisions. Algorithms for this simulation are well known, but require a guard digit during addition and subtraction to be reasonably efficient. So the Cray's lack of a guard digit made this approach unattractive too.

## Example 3: Computing x/sqrt(x^2+y^2)

A colleague once wrote a Cray program containing the expression
```        z = arccos( x/sqrt(x^2+y^2) )
```
An enigmatic run-time error was generated, which he (and several company engineers) finally tracked down to the fact that the computed value of x/sqrt(x^2+y^2) exceeded 1, to which arccos() subsequently objected. This occurred on the Cray because of sloppy rounding in the multiplication, division, and square root functions. Barring over/underflow, it can not occur using IEEE arithmetic. The user is invited to find a proof.

## References for Error Analysis and Numerical Stability

• Cuppen's original paper was J. Cuppen, "A divide and conquer method for the symmetric tridiagonal eigenproblem" Numer. Math., v. 36, pp 177-195, 1981.
• It is based on "Rank-one modification of the symmetric eigenproblem", by J. Bunch, P. Nielsen and D. Sorensen, Numer. Math., v 31, pp 31-48, 1978.
• A parallel implementation of Cuppen's method appeared in "A fully parallel algorithm for the symmetric eigenproblem", by J. Dongarra and D. Sorensen, SIAM J. Sci. Stat. Comput., v. 8, n. 2, pp 139-154, 1987. It was occasionally unstable for the reasons cited above.
• An analysis of the instability and a proposed fix using double precision appeared in "On the Orthogonality of Eigenvectors Computed by Divide-and-Conquer Techniques", by D. Sorensen and P. Tang, SIAM J. Num. Anal., v. 28, n. 6, pp 1752-1775, 1991.
• The improved solution now in LAPACK appeared in "A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem", by M. Gu and S. Eisenstat, SIAM J. Mat. Anal. Appl., v. 13, n. 4, pp 1266-1276, 1994.
• ## Exception Handling

When the result of a floating point operation is not representable as a normalized floating point number, and exception occurs. Before the IEEE standard, machines responded to exceptions in diverse ways, most causing the program to abort, but some continuing to compute using some default result. Programs that purported to be both robust and portable therefore had to assume the worst, that exceptions were invariably fatal, and so use defensive programming techniques that tried to anticipate most possible sources of exceptions and account for them ahead of time. LAPACK is constructed this way, for example. The price is slow, complicated, and therefore harder to maintain software. For example, there is a 700 line subroutine (slatrs.f) designed to solve triangular systems of linear equations safely no matter how ill-conditioned or ill-scaled the problem; this routine is used in condition estimation.

A main contribution of IEEE arithmetic was to standardize on how exceptions are handled. There are five kinds of exceptions, as listed in the table below. The default response for all 5 types is to set a sticky status flag, and continue computing the the default result listed below. The status flag can be read by the user, and remains set until reset by the user (hence "sticky").

```      Exception Name       Cause                                  Default Result
--------------       -----                                  --------------

Overflow             Result too large to represent          Return +-Infinity
as a normalized number

Underflow            Result too small to represent          Return subnormal
as a nonzero normalized number         number or 0

Divide-by-zero       Computing x/0, where x is              Return +-infinity
finite and nonzero

Invalid              infinity-infinity, 0*infinity,         NAN
infinity/infinity, 0/0, sqrt(-1),
x rem 0, infinity rem y,
comparison with NAN,
impossible binary-decimal conversion

Inexact              A rounding error occurred              rounded result
```
Alternatively, the IEEE standard recommends, but does not mandate, an alternative response to an exception, namely trapping. This means interrupting execution and jumping to another piece of code, possibly written by the user. In other words, it does not merely mean jumping to the operating system, printing an error message, and stopping, although this is indeed what most systems do. In principle, one could branch to an entirely different algorithm to solve the problem, and continue computing. The default response and sticky flags also provide this capability, but the user must check for exceptions and branch explicitly.

This is the area where the most liberties are taken with implementations of IEEE arithmetic, since a complete implementation of all IEEE recommendations could impact programming languages, compilers, and operating systems. Also, interrupting precisely when an operation occurs is very difficult, when many operations are in partial state of execution in a pipelined execution unit; one either needs to have a complex enough system to back out of all the partially executed operations, or abandon many effective forms of pipelining.

## Writing faster numerical code using exception handling

Using either sticky flags or traps, it is possible to exploit the following simple paradigm to write fast, reliable software with IEEE arithmetic. It assumes we have two algorithms available for a problem, a fast, usually accurate, but occasionally unstable algorithm, and a slower, more reliable algorithm.
1. Use the fast algorithm to compute an answer; this will usually be accurate.
2. Quickly and reliably assess the accuracy of the computed answer.
3. In the unlikely event that the answer is not good enough, recompute it using the reliable algorithm.
The utility of this paradigm depends on there being a big difference in speed between the two algorithms, on being able to assess the accuracy quickly and reliably, and instabilities in the fast algorithm not causing the program to abort or run very slowly. This is where IEEE arithmetic comes in: we assume that instabilities, if they occur, result in a status flag being set (which we can check in step 2), along with either a trap, or infinities and NaNs propagating harmlessly (and quickly!) through the computation.

We illustrate this paradigm with two examples below. It is effective on many machines, except that on machines which implement arithmetic with infinities and NANs very slowly, the worst case running time (as opposed to the average), could be quite slow. See the paper by Demmel and Li in the references for details.

## Example 4: Condition Estimation

To compute an error bound for the solution of a linear system of equations Ax=b, we need an estimate of the condition number of A, ||A||*||inv(A)||, where ||.|| is some matrix norm. One may then approximately bound the error in the computed solution x by
```        || computed x - true x ||
-------------------------  <=  ||A|| * ||inv(A)|| * macheps
|| computed x ||
```
The problem is to estimate || inv(A) || without computing inv(A) explicitly, which would usually be more expensive than solving Ax=b in the first place. LAPACK and other packages use an optimization approach which attempt to find a unit vector z, such that || inv(A)*z || is as large as possible; this typically requires a handful of solutions of linear systems of the form A*y=z, for carefully chosen z vectors. If we have already done Gaussian elimination and factored the matrix A into the product A=L*U of triangular matrices, the extra work involved in condition estimation is usually negligible. This is implemented in LAPACK routine sgecon.f.

Thus, the inner loop of the algorithm requires the solution of triangular linear systems (with matrices L and U), where we expect the solution to be large. This is where exceptions come in. When A is nearly singular, so the computed solution is quite large, overflows are most likely. But if A is nearly singular, this is precisely when the user is most likely to want to do condition estimation. Therefore, we cannot simply call the existing, manufacturer-supplied, highly-optimized triangular system solver in the BLAS, because it may overflow. Therefore, in LAPACK we call our own complicated, slow and reliable alternative slatrs.f, which is careful to scale inside the inner loop to avoid possible exceptions.

Applying our three-point paradigm to this situation yields the algorithm

1. Do condition estimation using a highly-optimized but overflow-susceptible triangular system solver.
2. Check the exception status flags; if no exception occurred, fine.
3. Otherwise, recompute using slatrs, or simply signal that the matrix is highly ill-conditioned.
Thus, in this special case, we can even dispense with the slow reliable code altogether, since any very large condition number gives as much information to the user as any other ("your computed solution is completely unreliable").

The result is an algorithm which is 1.5 to 5 times faster than the old one on a Sun 4/260, 1.6 to 9.3 times faster on a DEC Alpha, and 2.55 to 4.21 times faster on a Cray C90. A Cray C90 does not have IEEE exception handling, but these speedups were obtained in the most common case where no exceptions occurred. Thus, for the Cray, it measures the speed penalty paid on Cray by not having exception handling.

## Example 5: Finding Eigenvalues Using Bisection

Bisection is a standard serial and (more importantly) parallel algorithm for finding eigenvalues of symmetric tridiagonal matrices. If T is an n-by-n tridiagonal matrix
```        [ a(1) b(1)                    ]
[ b(1) a(2) b(2)               ]
T =  [       ...  ...    ...        ]
[           b(n-1) a(n-1) b(n) ]
[                  b(n)   a(n) ]
```
and s is a shift, then barring exceptions the following simple algorithm counts the number of eigenvalues of T less than s:
```       count = 0
b(0) = 0
d = 1
for i = 1 to n
d = a(i) - s - b(i-1)^2/d
if (d<0) count = count + 1
end for
```
To make this code robust, we must take into account exceptions in the inner loop. For example, if s were an eigenvalue of the leading k-by-k submatrix of T, then (in exact arithmetic) the k-th value of d would be exactly zero. Assuming we prescale T so its largest entry is neither too large nor too small (making ||T|| equal to the fourth root of the overflow threshold is a good value), then the following code is robust across all machines, and is currently in LAPACK routine slaebz.f (called by sstebz):
```       count = 0
b(0) = 0
d = 1
pivmin = sqrt(underflow_threshold)
for i = 1 to n
d = a(i) - s - b(i-1)^2/d
if (abs(d) < pivmin) d = -pivmin      ... guarantees division by |d| not too small
if (d < 0) count = count + 1
end for
```
However, this runs about 14% to 47% slower than the simpler algorithm above. With IEEE arithmetic, it turns out that the first algorithm is perfectly correct even when d becomes zero. The result is that the next d will be +-infinity, and on the subsequent iteration -b(i-1)^2/d will be zero again. One can show that this yields accurate and reliable results; see the reference below.

## References for Exception Handling

In addition to Click here for Prof. Kahan's detailed "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic," see
• See "Faster Numerical Algorithms via Exception Handling", by J. Demmel and X. Li, IEEE Trans. Computers, v 43, n 8, Aug 1994, pp 983-992, for a more detailed discussion of speedups from exception handling. (An old version is available as LAPACK Working Note 59.)
• See "On the correctness of parallel bisection in floating point", J. Demmel, I. Dhillon, H. Ren, LAPACK Working Note 70, for a discussion of various versions of bisection.