The IEEE floating point standards prescribe precisely how floating point numbers should be represented, and the results of all operations on floating point numbers. There are two standards: IEEE 754 is for binary arithmetic, and IEEE 854 covers decimal arithmetic as well. We will only discuss IEEE 754, which has been adopted almost universally by computer manufacturers. Indeed, except for older, obsolete architectures (such as IBM 370 and VAX), and the Cray XMP, YMP, 2, C90 and first generation T90, all other machines, including PCs, workstations, and parallel machines built from the same microprocessors, use IEEE arithmetic. The next Cray T90 is supposed to implement IEEE arithmetic.

So the good news is that essentially all new computers will be using the same floating point arithmetic in the near future, and almost all do now, with the important exception of Cray. This simplifies code development, especially portable code development, as well as some important rounding error analyses, as we will illustrate below.

Unfortunately, not all manufacturers implement every detail of IEEE arithmetic the same way. Everyone does indeed represent numbers with the same bit patterns, and rounds results correctly (or tries to!). But exceptional conditions (like 1/0, sqrt(-1) etc.), whose semantics are also specified in detail by the IEEE standard, are not always handled the same way. It turns out that many manufacturers believe (sometimes rightly and sometimes wrongly) that conforming to every detail of IEEE arithmetic would make their machines either a bit slower or a bit more expensive, enough so make them less attractive in the marketplace. In particular, machines are often sold on the basis of how fast they execute certain standard benchmarks, like the Linpack Benchmark, which do not generate exceptions, so exception handling often gets short shrift. We will discuss the impact of this on code development.

Floating point formats Name of format Bytes to Number of bits Macheps Number of Approx range store in significand exponent bits -------------- -------- -------------- ------- ------------- ------------ Single 4 24 2^(-24) ~ 1e-7 8 10^(+-38) Double 8 53 2^(-53) ~ 1e-16 11 >=10^(+-308) Double extended >=10 >=64 <=2^(-64) ~ 1e-19 >=15 >=10^(+-4932) Quadruple 16 113 2^(-113)~ 1e-34 15 10^(+-4932)Macheps is short for "machine epsilon", and is used below for roundoff error analysis. The distance between 1 and the next larger floating point number is 2*macheps.

When the exponent has neither its largest possible value (a string of all ones) nor its smallest possible value (a string of all zeros), then the significand is necessarily normalized, and lies in [1,2).

When the exponent has its largest possible value (all ones), the floating point number
is either +infinity, -infinity, or NAN (not-a-number); we discuss
these further below under **exceptions**.
The largest finite floating point number is
called the *overflow threshold*.

When the exponent has its smallest possible value (all zeros), the significand may have
leading zeros, and is called either subnormal or denormal (unless it is exactly zero).
The subnormal floating point numbers represent very tiny numbers between the smallest
nonzero normalized floating point number (the *underflow threshold*) and zero. An operation
that underflows yields a subnormal number or possibly zero; this is called *gradual
underflow*. The alternative, simply returning a zero, is called *flush to zero*.
When the significand is zero, the floating point number is +-0.

The basic operations specified by IEEE arithmetic are first and foremost addition, subtraction, multiplication and division. Square roots and remainders are also included. The default rounding for these operations is "to nearest even". This means that the floating point result fl(a op b) of the exact operation (a op b) is the nearest floating point number to (a op b), breaking ties by rounding to the floating point number whose bottom bit is zero (the "even" one). It is also possible to round up, round down, or truncate (round towards zero). Rounding up and down are useful for interval arithmetic, which can provide guaranteed error bounds; unfortunately most languages and/or compilers provide no access to the status flag which can select the rounding direction.

Other operations include binary-decimal conversion, floating point-integer conversions, and conversions among floating point formats, and comparisons.

(*) fl(a op b) = (1+ eps)*(a op b) provided fl(a op b) neither overflows nor underflowswhere

(a op b) is the exact result of the operation, where op is one of +, -, * and / fl(a op b) is the floating point result | eps | <= machepsTo incorporate underflow, we may write

fl(a op b) = (1+ eps)*(a op b) + eta provided fl(a op b) does not overflowwhere

| eta | <= macheps * underflow_thresholdOn machines which do not implement gradual underflow (including some IEEE machines which have an option to perform flush-to-zero arithmetic instead), the corresponding error formula is

| eta | <= underflow_thresholdi.e. the error is 1/macheps times larger.

(**) fl(a +- b) = ( (1+eps1)*a ) +- ( (1+eps2)*b ) provided fl(a +- b ) neither overflows nor underflowsIn particular, this means that if a and b are nearly equal, so a-b is much smaller than either a or b (this is called

This happens because when y's significand is shifted right to line its binary point
up with 1's binary point, prior to actually subtracting the significands,
any bits shifted past the least significant bit of 1 are simply thrown away
(differently on a C90 and Cray 2), rather than participating in the subtraction.
On IEEE machines, there is a so-called *guard digit* (actually 3 of them) to store
any bits shifted this way, so they can participate in the subtraction.

This source of error has little effect on most problems in numerical linear algebra, but there are some important exceptions, as illustrated below.

error = alg(x) - f(x)Rather than trying to compute the error directly, we proceed as follows.

We first try to show that
alg(x) is *numerically stable*, that is alg(x) is nearly the exact result f(x+e) of
a slightly perturbed problem x+e:

alg(x) ~ f(x+e) where e is small

Second, assuming f(x) is a smooth function at x, we approximate f(x) by a first-order Taylor expansion near x:

f(x+e) ~ f(x) + e*f'(x)

Combining the last two expressions yields

error = alg(x) - f(x) ~ f(x+e) - f(x) ~ f(x) + e*f'(x) - f(x) = e*f'(x)Thus, we have expressed the error as a product of a small quantity e and the derivative f'(x). e is usually called the

|| error || <~ || e || * || f'(x) ||(The norm of a vector ||v|| or matrix ||M|| might simply be the root-sum-of-squares of its entries, for example.) Computing || f'(x) || exactly is often as expensive as solving the original problem. Therefore we often settle for an approximation; this is called

Roundoff error analysis is needed to show that e is small. We illustrate with a few examples.

If we evaluate log(1-x)/x in IEEE floating point arithmetic using this formula (except for the special case f(0)=-1), the computed graph is shown below. Note that the answer is quite wrong near x=0, with a spike going from -2 to 0:

Blowing up the graph near x=0 shows a sort of periodic behavior:

Blowing the graph up even more yields the following picture. To hint at what is going on, we have made green tic marks at all x values such that 1-x is an exact floating point number. The red tic marks are exactly half way in between the green tic marks, and are the thresholds where fl(1-x) suddenly changes from one floating point value to the next.

When x is tiny, log(1-x) is very close to -x. Said another way, when y is very close to 1, log(y) is very close to y-1. The next graph shows ((1-x)-1)/x, as computed in floating point. In exact arithmetic, this expression simplies to -1. But in floating point, it yields exactly the same picture as before:

The next graph shows both (1-x)-1 and -x (both computed in floating point) on the same graph. Their quotient is what was shown in the last graph. It is easy to see why the quotient is far from -1: When x is tiny, fl(1-x) can only take on a few discrete values near 1, so fl(fl(1-x)-1) can in turn only take on a few discrete values, whereas x more nearly takes on a continuum of values.

Here is a different algorithm for f(x) which computes it quite precisely in IEEE arithmetic:

y=1-x if y == 1, f = -1 else f = log(y)/(1-y) endifIn real arithmetic, the algorithm is clearly equivalent to f = log(1-x)/x (except at x=0, where it has the right limiting value). The graph of this function is shown below; it has no difficulty near x=0.

This algorithm is *backward stable*, because it nearly computes log(1-(x+e))/(x+e),
where e is small (on the order of of macheps); this makes the very reasonable assumption
that the logarithm function is implemented accurately (this is not difficult).
We invite the reader to supply a proof. (Hint: when x is small, there is no roundoff
error committed when computing fl(1-y).)

Unfortunately, this improved algorithm is not backward stable on the Cray C90, as illustrated in the graph below, which only shows the graph for x near 0:

The subsequent plot is a bigger blowup near x=0, with floating point numbers of the form 1-x marked as before.

The following plots show intermediate results in the computation, both on an IEEE machine and on the Cray, for contrast. The final result on an IEEE machine is the quotient of the two values plotted in the two rightmost graphs; their quotient is clearly -1. The final result on a C90 is the quotient of the two values plotted in the two leftmost graphs; their quotient is clearly not -1 for x>0.

The problem lies in the fact that fl(1-y) is off by a factor of up to 2 when y is near 1 on a C90, whereas fl(1-y) is exact in IEEE arithmetic. The cure on a C90 is to replace

y = 1-x by y = .5 + (.5 - x) and f = log(y)/(1-y) by f = log(y)/(.5+(.5-y))

The reader is invited to prove why this works, and to rejoice that the imminent universality of IEEE arithmetic will make such tricks unnecessary.

x = ( D - lambda*I )^(-1) * zUnfortunately, this formula is quite inaccurate when there are other eigenvalues very close to lambda. In particular, eigenvectors of different, but nearby, eigenvalues are not guaranteed to be orthogonal, which is essential for stability. This difficulty was attacked by Sorensen, Tang, Kahan and others, and proposed solutions were based on using double the input precision for certain critical parts of the calculation. When the input precision is already double, this would require quadruple, which is not widely available; this was the major stumbling block to widespread adoption of this algorithm for many years.

Finally, in 1992 Eisenstat and Gu nearly completely overcame this problem by using a different, stable formula, now incorporated in LAPACK routine sstedc (and slaed3, which sstedc calls). We refer the reader to the paper by Gu and Eisenstat cited below for details. Here we will just examine the kernel of their formula

(*) x(i) = prod_{j=1 to n except i} q(i,j) / ( lambda(i) - lambda(j) )where q(i,j) is some other (accurately) computed quantity. Stability depends on this formula being evaluated accurately.

With IEEE arithmetic we may apply error analysis formula (*) above. Assuming for simplicity that q(i,j) is exact, we see that the computed x(i) differs from the true x(i) by a factor f, where f is the product of 2n-1 different terms of the form 1+eps, |eps| <= macheps. Thus x(i) is computed to high relative accuracy as desired.

With Cray arithmetic, on the other hand, if some difference lambda(i) - lambda(j) undergoes extreme cancellation, it may not be computed accurately. If q(i,j) is also small, the factor q(i,j) / ( lambda(i) - lambda(j) ) may be O(1) and completely wrong, so that x(i) is completely wrong. This can indeed occur, and this temporarily prevented us from incorporating divide-and-conquer into LAPACK, until we figured out what to do.

The fix for this problem was proposed by W. Kahan, who pointed out that if no lambda(i) has a nonzero least significant bit, then when extreme cancellation occurs no information is lost by shifting. Thus, simply setting the bottommost bit of each lambda(i) to zero, a tiny error, would fix the problem. Doing this in a portable way, that would run whether compiled on a Cray, PC or any other machine, required the following trick: It turns out that computing

lambda = ( lambda + lambda ) - lambdahas no effect on lambda on any binary machine which rounds in any reasonable way (barring overflow). But on a Cray, it makes the bottom bit of lambda zero; the reader is invited to show why.

This still does not explain the ultimate appearance of the LAPACK code. It turns out that many aggressive optimizing compilers will examine the expression above, and simply throw it away, because in exact arithmetic, it obviously would have no effect. To try to prevent this, (lambda+lambda) is actually computed by a function call, where the function is compiled at a lower level of optimization. This is still no guarantee, but we hope that by the time compilers become aggressive enough for this to be a problem, non-IEEE arithmetics will have disappeared.

As a historical note, for a while we considered implementing this algorithm in
*doubled precision* which represents double precision quantities as
a pair (upper,lower) of single precision quantities, and manipulates these pairs.
The attraction is that if the input precision is already double, this simulates
quadruple, letting us use the same algorithm in all precisions. Algorithms for
this simulation are well known, but require a guard digit
during addition and subtraction to be reasonably efficient. So the Cray's lack of a
guard digit made this approach unattractive too.

z = arccos( x/sqrt(x^2+y^2) )An enigmatic run-time error was generated, which he (and several company engineers) finally tracked down to the fact that the computed value of x/sqrt(x^2+y^2) exceeded 1, to which arccos() subsequently objected. This occurred on the Cray because of sloppy rounding in the multiplication, division, and square root functions. Barring over/underflow, it can not occur using IEEE arithmetic. The user is invited to find a proof.

A main contribution of IEEE arithmetic was to standardize on how exceptions are handled.
There are five kinds of exceptions, as listed in the table below. The default response
for all 5 types is to set a *sticky status flag*, and continue computing the the
default result listed below. The status flag can be read by the user, and remains set
until reset by the user (hence "sticky").

Exception Name Cause Default Result -------------- ----- -------------- Overflow Result too large to represent Return +-Infinity as a normalized number Underflow Result too small to represent Return subnormal as a nonzero normalized number number or 0 Divide-by-zero Computing x/0, where x is Return +-infinity finite and nonzero Invalid infinity-infinity, 0*infinity, NAN infinity/infinity, 0/0, sqrt(-1), x rem 0, infinity rem y, comparison with NAN, impossible binary-decimal conversion Inexact A rounding error occurred rounded resultAlternatively, the IEEE standard recommends, but does not mandate, an alternative response to an exception, namely

This is the area where the most liberties are taken with implementations of IEEE arithmetic, since a complete implementation of all IEEE recommendations could impact programming languages, compilers, and operating systems. Also, interrupting precisely when an operation occurs is very difficult, when many operations are in partial state of execution in a pipelined execution unit; one either needs to have a complex enough system to back out of all the partially executed operations, or abandon many effective forms of pipelining.

- Use the fast algorithm to compute an answer; this will usually be accurate.
- Quickly and reliably assess the accuracy of the computed answer.
- In the unlikely event that the answer is not good enough, recompute it using the reliable algorithm.

We illustrate this paradigm with two examples below. It is effective on many machines, except that on machines which implement arithmetic with infinities and NANs very slowly, the worst case running time (as opposed to the average), could be quite slow. See the paper by Demmel and Li in the references for details.

|| computed x - true x || ------------------------- <= ||A|| * ||inv(A)|| * macheps || computed x ||The problem is to estimate || inv(A) || without computing inv(A) explicitly, which would usually be more expensive than solving Ax=b in the first place. LAPACK and other packages use an optimization approach which attempt to find a unit vector z, such that || inv(A)*z || is as large as possible; this typically requires a handful of solutions of linear systems of the form A*y=z, for carefully chosen z vectors. If we have already done Gaussian elimination and factored the matrix A into the product A=L*U of triangular matrices, the extra work involved in condition estimation is usually negligible. This is implemented in LAPACK routine sgecon.f.

Thus, the inner loop of the algorithm requires the solution of triangular linear systems (with matrices L and U), where we expect the solution to be large. This is where exceptions come in. When A is nearly singular, so the computed solution is quite large, overflows are most likely. But if A is nearly singular, this is precisely when the user is most likely to want to do condition estimation. Therefore, we cannot simply call the existing, manufacturer-supplied, highly-optimized triangular system solver in the BLAS, because it may overflow. Therefore, in LAPACK we call our own complicated, slow and reliable alternative slatrs.f, which is careful to scale inside the inner loop to avoid possible exceptions.

Applying our three-point paradigm to this situation yields the algorithm

- Do condition estimation using a highly-optimized but overflow-susceptible triangular system solver.
- Check the exception status flags; if no exception occurred, fine.
- Otherwise, recompute using slatrs, or simply signal that the matrix is highly ill-conditioned.

The result is an algorithm which is 1.5 to 5 times faster than the old one on a Sun 4/260, 1.6 to 9.3 times faster on a DEC Alpha, and 2.55 to 4.21 times faster on a Cray C90. A Cray C90 does not have IEEE exception handling, but these speedups were obtained in the most common case where no exceptions occurred. Thus, for the Cray, it measures the speed penalty paid on Cray by not having exception handling.

[ a(1) b(1) ] [ b(1) a(2) b(2) ] T = [ ... ... ... ] [ b(n-1) a(n-1) b(n) ] [ b(n) a(n) ]and s is a shift, then barring exceptions the following simple algorithm counts the number of eigenvalues of T less than s:

count = 0 b(0) = 0 d = 1 for i = 1 to n d = a(i) - s - b(i-1)^2/d if (d<0) count = count + 1 end forTo make this code robust, we must take into account exceptions in the inner loop. For example, if s were an eigenvalue of the leading k-by-k submatrix of T, then (in exact arithmetic) the k-th value of d would be exactly zero. Assuming we prescale T so its largest entry is neither too large nor too small (making ||T|| equal to the fourth root of the overflow threshold is a good value), then the following code is robust across all machines, and is currently in LAPACK routine slaebz.f (called by sstebz):

count = 0 b(0) = 0 d = 1 pivmin = sqrt(underflow_threshold) for i = 1 to n d = a(i) - s - b(i-1)^2/d if (abs(d) < pivmin) d = -pivmin ... guarantees division by |d| not too small if (d < 0) count = count + 1 end forHowever, this runs about 14% to 47% slower than the simpler algorithm above. With IEEE arithmetic, it turns out that the first algorithm is perfectly correct even when d becomes zero. The result is that the next d will be +-infinity, and on the subsequent iteration -b(i-1)^2/d will be zero again. One can show that this yields accurate and reliable results; see the reference below.