Notes for Ma221, Lecture 2, Jan 22, 2016 Goals: Floating point arithmetic Roundoff error analysis for polynomial evaluation Beyond basic error analysis: exceptions, high precision arithmetic, reproducibility, interval arithmetic, exploiting mathematical structure to get accuracy without high precision Example: Polynomial Evaluation, and polynomial zero finding EX: review how bisection to find a root of f(x)=0 works: start with an interval [x1,x2] where f changes sign: f(x1)*f(x2) < 0 evaluate at midpoint: f((x1+x2)/2) keep bisecting subinterval where f changes sign try it on (x-2)(x-3)(x-4) = x^3 - 9*x^2 + 26*x - 24 rts = [2,3,4] coeff = poly(rts) help bisect (on web page) bisect(coeff,2.6,3.1,1e-12) bisect(coeff,2.2,3.2,1e-12) ... no surprises, get small intervals around 3 now try it on (x-2)^13 rts = 2*ones(1,13) coeff = poly(rts) ... no errors yet, take my word for it bisect(coeff,1.7,2.4,1e-12) bisect(coeff,1.7,2.2,1e-12) bisect(coeff,1.9,2.2,1e-12) ... huh? a very different answer each time? Horner's rule to (x-2)^13 (Matlab demo) - what is the real graph? x = 1.7:1e-4:2.3; y = polyval(coeff,x); yacc = (x-2).^13; plot(x,y,'k.',x,yacc,'r') axis([1.7,2.3,-1e-8,1e-8]) ... can we explain this? To summarize: Try evaluating (x-2)^13 two ways: as (x-2)^13 - smooth, monotonic curve, as expected as x^13 - 26*x^12 + ... - 8192, with Horner's rule: for x in the range [1.8,2.2], basically get random numbers in the range [-1e-8,1e-8] why the big difference? need to understand floating point arithmetic Floating Point - How real numbers are represented in a computer Long ago, computers did floating point in many different ways, making it hard to understand bugs and write portable code. Fortunately Prof. Kahan led an IEEE standards committee that convinced all the computer manufacturers to agree on one way to do it, called the IEEE Floating Point Standard, for which he won the Turing Award. This was in 1985, the standard was updated in 2008, and is being updated again now. See his web page for much more information: www.cs.berkeley.edu/~wkahan Scientific Notation: +- d.ddd x radix^e Floating point usually uses radix=2 (10 sometimes, 16 historically) so you need to store sign bit (+-), exponent (e), and mantissa (d.ddd). Both p = #digits and the exponent range are limited, to fit into 32 bits, 64 bits, etc. For simplicity, we will briefly ignore the limit on exponent range, i.e. assume no overflow or underflow. Normalization: We use 3.1000e0 not 0.0031e3 - i.e. the leading digits are nonzero. Normalization gives uniqueness of representations, which is useful. Def: rnd(x) = nearest floating point number to x Relative Representation Error (RRE): | x - rnd(x) | / | rnd(x) | Maximum Relative Representation Error = max_x RRE(x) aka machine epsilon, macheps, = half distance from 1 to next larger number 1+radix^(1-p) = .5 * radix^(1-p) = | (1+.5*radix^(1-p)) - 1 | / 1 = 2^(-p) in binary Note: eps in Matlab = 2^(-52) = 2*macheps Roundoff error model, assuming no over/underflow: fl(a op b) = rnd(a op b) = true result rounded to nearest = (a op b)(1 + delta), |delta| <= macheps where op may be add, subtract, multiply or divide Use throughout course, all you need for most algorithms true for complex arithmetic (see Q 1.12 for details, use bigger macheps) IEEE single(S)/double(D)/quad(Q)/half(H) formats : radix = 2 S: 32=1+8+23 bits, -126 <= e <= 127, p = 24, macheps = 2^(-24) ~ 6e-8 overflow threshold (OV) ~ 2^128 ~ 1e38, underflow threshold (UN) = 2^(-126) ~ 1e-38 D: 64=1+11+52 bits, -1022 <= e <= 1023, p=53, macheps = 2^(-53) ~ 1e-16 OV ~ 2^1024 ~ 1e308, UN = 2^(-1022) ~ 1e-308 Q: 128=1+15+112 bits, -16382 <= e <= 16383, macheps = 2^(-113) ~ 1e-34 OV ~ 2^16384 ~ 1e4932, UN = 2^(-16382) ~ 1e-4932 H: 16=1+5+10 bits, -14 <= e <= 15, macheps = 2^(-11) ~ 5e-4 OV ~ 2^15 ~ 1e4, UN = 2^(-14) ~ 1e-4 Note: S and D available in hardware in most platforms. Q is usually in software (added to 2008 standard). E(xtended), an 80-bit format on Intel x86 architectures, was in the old IEEE standard from 1985, but is now deprecated. Half precision (16 bits) is in the standard as a storage format, but now some computers are adding arithmetic operations on them too, because they can be much faster. Higher available via software simulation (see ARPREC, GMP on web page) 2008 standard also added decimal arithmetic (for Excel...) That's enough to understand the plot of (x-2)^13, but more about floating point later. Analyze Horner's Rule for evaluating p(x): simplest expression: p = sum_{i=0 to d} a_i x^i algorithm: p = a_d, for i=d-1:-1:0, p = x*p + a_i label intermediate results (no roundoff yet): p_d = a_d, for i=d-1:-1:0, p_i = x*p_{i+1} + a_i introduce roundoff terms: p_d = a_d, for i=d-1:-1:0, p_i = [x*p_{i+1}*(1+d_i) + a_i]*(1+d'_i) where |d_i| <= macheps and |d'_i| <= macheps p_0 = sum_{i=0:d-1} [(1+d'_i)*prod_{j=0:i-1}(1+d_j)*(1+d'_j)*a_i*x^i + prod_{j=0:d-1} (1+d_j)*(1_d'_j)*a_d*x^d = sum_{i=0:d-1} [product of 2i+1 terms like 1+d] a_i*x^i + [product of 2d terms like (1+d)] a_d*x^d = sum_{i=0:d} a'_i * x^i In words: Horner is backward stable: you get exact value of polynomial at x but with slightly changed coefficients a'_i from input p(x) how to simplify to get error bound prod_{i=1:n} (1 + delta_i) <= prod_{i=1:n} (1 + macheps) = 1 + n*macheps + O(macheps^2) ... usually ignore (macheps^2) <= 1 + n*macheps/(1-n*macheps) if n*macheps < 1 ... (lemma left to students) similarly prod_{i=1:n} (1 + delta_i) >= prod_{i=1:n} (1 - macheps) = 1 - n*macheps + O(macheps^2) ... usually ignore (macheps^2) >= 1 - n*macheps/(1-n*macheps) if n*macheps < 1 ... (lemma left to students) so |prod_{1 to n} (1 + delta) - 1| <= n*macheps ... ignore macheps^2 and thus |computed p_d - p(x)| <= sum_{i=0:d-1} (2i+1)*macheps*|a_i*x^i| + 2*d*macheps*|a_d*x^d| <= sum_{i=0:d} 2*d*macheps*|a_i*x^i| relerr = |computed p_d - p(x)|/|p(x)| <= sum_{i=0:d} |a_i x^i| / |p(x)| * 2*d*macheps = condition number * relative backward error How many decimal digits can we trust? d correct digits <=> relative error = 10^(-d) <=> -log_10 (relative error) = d How to modify Horner to compute error bound: p = a_d ebnd = |a_d|, for i=d-1:-1:1, p = x*p + a_i, eb = |x|*eb + |a_i| ebnd = 2*d*macheps*ebnd Matlab demo: coeff = poly(2*ones(13,1)); x = [1.6:1e-4:2.4]; y = polyval(coeff,x); yacc = (x-2).^13; ebnd = 13*eps*polyval(abs(coeff),abs(x)); % note eps in Matlab = 2*macheps plot(x,y,'k.',x,yacc,'c',x,y-ebnd,'r',x,y+ebnd,'r') axis([1.65 2.35 -1e-8 1e-8]), grid % need to make vertical axis wider to see bounds axis([1.65 2.35 -1e-6 1e-6]), grid % conclusion: don't trust sign outside roughly [1.72, 2.33] Consider Question 1.21: how could we use this error bound to stop iterating in root finding using bisection? ... now try wider range, look at actual and estimated # correct digits x = -1:.0001:5; y = polyval(coeff,x); yacc=(x-2).^13; ebnd=13*eps*polyval(abs(coeff),abs(x)); plot(x,-log10(abs((y-yacc)./yacc)),'k.',x,-log10(ebnd./abs(yacc)),'r') axis([-1 5 0 16]), grid title('Number of correct digits in y') This picture is a foreshadowing of what will happen in linear algebra: The vertical axis is the number of correct digits, both actual (the black dots) and lower-bounded using our error bound (the red curve). The horizontal axis is the problem we are trying to solve, in this simple case the value of x at which we are evaluating a fixed polynomial p(x). The number of correct digits gets smaller and smaller, until no leading correct digits are computed, the closer the problem gets to the hardest possible problem, in this case the root x=2 of the polynomial. This is the hardest problem because the only way to get a small relative error in the solution p(2)=0, is to compute 0 exactly, i.e. no roundoff is permitted. And changing x very slightly makes the answer p(x) change a lot, relatively speaking. In other words, the condition number, sum_i |a_i x^i| / |p(x)| in this simple case, approaches infinity as p(x) approaches zero. In linear algebra the horizontal axis still represents the problem being solved, but since the problem is typically defined by an n-by-n matrix, we need n^2 axes to define the problem. There are now many "hardest possible problems", for example singular matrices if the problem is matrix inversion. The singular matrices form a set of dimension n^2-1 in the set of all matrices, the surface defined by det(A)=0. And the closer the matrix is to this set, i.e. to being singular, the harder matrix inversion will be, in the sense that the error bound will get worse and worse the closer the matrix is to this set. Later we will show how to measure the distance from a matrix to the nearest singular matrix "exactly" (i.e. except for roundoff) using the SVD, and show that the condition number, and so the error bound, is inversely proportional to this distance. Here is another way the above figure foreshadows linear algebra. Recall that we could interpret the computed value of the polynomial p(x) = sum_i a_i*x^i, with roundoff errors, as the exactly right value of a slightly wrong polynomial, that is p_alg(x) = sum_{i=0:d} [(1+e_i)*a_i]*x^i, where |e_i| <= 2*d*macheps. We called this property "backward stability", in contrast to "forward stability" which would means that the answer itself is close to correct. So the error bound bounds the difference between the exact solutions of two slightly different problems p(x) and p_alg(x). For most linear algebra algorithms, we will also show they are backward stable. For example, if we want to solve A*x=b, we will instead get the exact solution of (A+Delta)*xhat = b, where the matrix Delta is "small" compared to A. Then we will get error bounds by essentially taking the first term of a Taylor expansion of inv(A+Delta): xhat - x = inv(A+Delta)*b - inv(A)*b To do this, we will need to introduce some tools like matrix and vector norms (so we can quantify what "small" means) in the next lecture. Now we finish with some final comments on computer arithmetic and roundoff error. First, notice the analogy between Horner's rule and computing dot-products: p = a_d, for i=d-1:-1:1, p = x*p + a_i s = 0 , for i=1:d, s = x_i*y_i + s Thus the error analysis of dot products is very similar (homework 1.10). Next we briefly discuss some properties and uses of floating point that go beyond these most basic properties (more details are given at the end of these notes). (1) Exceptions: IEEE arithmetic has rules for cases like: Underflow: Tiny / Big = 0 (or "subnormal", special numbers at bottom of exponent range) Overflow and Divide-by-Zero 1/0 = Inf = "infinity", represented as special case in usual format Big + Big = Inf, Big*Big = Inf, 3 + Inf = Inf, etc. Invalid Operation:: 0/0 = NaN = "Not a Number", also represented as special case Inf - Inf = NaN, sqrt(-1) = NaN, 3 + NaN = NaN, etc Flags available to check if such exceptions occurred. Impact on software: Reliability, if handled correctly (see Ariane 5 crash on webpage) We are currently investigating how to guarantee that LAPACK cannot "misbehave" because of exceptions (go into infinite loops, give wrong answer without warning, etc.) Speed: Run ``reckless'' algorithm, that is fast but ignores possible exceptions Check flags to see if exception occurred In rare case of exception, rerun with slower, more reliable algorithm (2) Higher precision arithmetic: possible to simulate using either fixed or floating point, various packages available: MPFR, ARPREC, XBLAS (see web page). There are also special fast tricks just for high precision summation; see Q1.18. (3) Reproducibility: Almost all users expect that running the same program more than once gives the bitwise identical answer; important for debugging, correctness, even legal reasons sometimes. But this can no longer be expected, even on your multicore laptop, because parallel computers (so nearly all now), may excecute sums in different orders, and roundoff makes summation nonassociative: (1-1)+1e-20 = 1e-20 neq 0 = (1+1e-20)-1 There is lots of work on coming up with algorithms that fix this, without slowing down too much, such as ReproBLAS (see web page). There is even possible interest in adding support for reproducibilty to the IEEE 754 standard. (4) Guaranteed error bounds from Interval Arithmetic: Represent each floating point number by an interval [x_low,x_high], and use special rounding modes in IEEE standard to make sure that lower and upper bounds are maintained throughout the computation: [x_low,x_high] + [y_low,y_high] = [round_down(x_low+y_low),round_up(x_high+y_high)] Drawback: naively replacing all variables by intervals like this often makes interval widths grow so fast that they are useless. There have been many years of research on special algorithms that avoid this, especially for linear algebra (see web page). (5) Exploiting structure to get high accuracy: Some matrices have special mathematical structure that allows formulas to be used where roundoff provably bounded so that you get high relative accuracy, i.e. most leading digits correct. For example, a Vandermonde matrix has entries V(i,j) = x_i^j, and arises naturally from polynomial interpolation problems. It turns out that this special structure permits many linear algebra problems, even eigenvalues, to be done accurately, no matter how hard (``ill-conditioned'') the problem is using conventional algorithms. See webpage for details. Here are more details about the above floating point issues: (1) Exceptions: what happens when computed result would have an exponent that is too big to fit, too small to fit, or isn't mathematically defined, like 0/0? Get "exceptions": Rules about result and how to compute with them, plus flags to check to see if an exception occurred What if answer > OV? get infinity (overflow) fl(1e20*1e20) = fl(1/0) = inf ... in single fl(3+inf) = inf, fl(1/inf) = 0, fl(1/-0) = -inf What if answer mathematically undefined? get NaN = Not a Number fl(sqrt(-1)) = fl(0/0) = fl(inf-inf) = NaN 3 + NaN = NaN ... so you see NaN on output if one occurred What if answer < UN? get underflow What to do: if you return zero, then what happens with code: if (a .ne. b) then x = 1/(a-b) ... can divide by zero, Instead, IEEE standard says you get "subnormal numbers" x = +- 0.dddd * 2^exp_min instead of +- 1.dddd * 2^exp [Draw IEEE number line] Without subnormals, do error analysis with fl(a op b) = (a op b)(1 + delta) + eta, |delta| <= macheps, |eta| <= UN With subnormals, can do error analysis with fl(a op b) = (a op b)(1 + delta) + eta, |delta| <= macheps, |eta| <= UN*macheps for *,/ and eta=0 for +- Thm: With subnormals, for all floats a,b fl(a-b) = 0 iff a=b Purpose: simplify reasoning about floating point algorithms Why bother with exceptions? Why not always just stop when one occurs? (1) Reliability: too hard to have test before each floating point operation to avoid exception Ex for control system (see Ariane 5 crash on webpage), Ex Matlab: don't want to go into infinite loop because of input NaN (caused several fixes to LAPACK, and also helped motivate an on-going CS research project to build tools to prove that NaNs, infinities, etc cannot cause infinite loops or analogous bugs) (2) Speed: ditto (too slow to test before each operation) (1) run "reckless" code that is fast but assumes no exceptions (2) check exception flags (3) in case of exception, rerun with slower algorithm Ex: s = root-sum-squares = sqrt(sum_{i=1}^n x_i^2) What could go wrong? (see Q 1.19) (3) Sometimes one can prove code correct with exceptions: Ex: Current fastest algorithms to find (some) eigenvalues of symmetric matrices depends on these, including 1/(-0) = -inf Impact on LAPACK: One of the fastest algorithms in LAPACK for finding eigenvalues of symmetric matrices assumes that 1/+0 = +infinity, and 1/-0 = -infinity, as specified by the IEEE standard. We were porting this code to an ATI GPU a couple of years ago, and discovered that they did not handle exceptions correctly: 1/(-0) was +infinity instead of -infinity. The code depended on getting -infinity to get the right answer, and (until they fixed their hardware) we had to take the quantity in the denominator 1/d and instead compute 1/(d+0), which made the -0 turn into a +0, whose reciprocal was correctly computed. See www.cs.berkeley.edu/~volkov for details Impact on Matlab: In 2009 we released a new version of an LAPACK routine used in used in least square solvers and eigenvalue routines, that could be a lot faster on band matrices, and could be used to more easily generate random orthogonal matrices with a certain probability distribution requested by some users. Mathworks noticed that is was more susceptible to over/underflow, and after careful analysis, we decided this was an intrinsic property, and we had to remove it from most parts of later releases. (2) Exploiting lack of roundoff in special cases to get high precision Fact: if 1/2 < x/y < 2 then fl(x-y) = x-y ... no roundoff Proof: Cancellation in x-y means exact result fits in p bits Fact: Suppose M >= abs(x). Suppose we compute S = fl(M+x), q = fl(S-M), r = fl(x-q) In exact arithmetic, we would compute r = x-q = x-(S-M) = x-((M+x)-M) = 0. But in floating point, S+r = M+x exactly, with r being the roundoff error, i.e. (S,r) is the double precision sum of M+x (Proof sketch in special case; only roundoff occurs in S=fl(M+x), other two operations are exact). This trick, called "Accurate 2-Sum" can be generalized in a number of ways to get very general algorithms available in the following software packages: ARPREC (see class website): provides arbitrary precision arithmetic XBLAS (see class website): provides some double-double precision Basic Linear Algebra Subroutines (BLAS), like matrix-vector-multiplication. LAPACK uses these routines to provide high accuracy solvers for Ax=b. ReproBLAS (see class website): Provides bit-wise reproducible parallel implementations of some BLAS. The challenge here is that since roundoff makes floating point addition nonassociative, computing a sum in a different order will usually give a (slightly) different answer. On a parallel machine, where the number of processors (and other resources) may be scheduled dynamically, computing a sum in a different order is likely. Since getting even slightly different results is a challenge for debugging and (in some cases) correctness, we and others are working on efficient algorithms that guarantee reproducibility. Finally, we point out the paper "Accurate and Efficient Floating Point Summation," J. Demmel and Y. Hida, SIAM J. Sci. Comp, 2003, for a efficient way to get full accuracy in a summation despite roundoff (in brief: to sum n numbers to full accuracy, you need to use log_2(n) extra bits of precision, and sum them in decreasing order by magnitude), and the website www.ti3.tu-harburg.de/rump/ for a variety of papers on linear algebra with guarantee accuracy. (3) Guaranteed error bounds (sometimes) via interval arithmetic: So far we have used rnd(x) to mean the nearest floating point number to x. (Note: Ties are broken by choosing the floating point number whose last bit is zero, also called "round to nearest even". This has the attractive property that half the ties are rounded up and half rounded down, so there is no bias in long sums.) But the IEEE Floating Point Standard also requires the operations rnd_down(x) = largest floating point number <= x rnd_up(x) = smallest floating point number >= x Thus [rnd_down(x+y),rnd_up(x+y)] is an interval guaranteed to contain x+y. And if [x_min,x_max] and [y_min,y_max] are two intervals guaranteed to contain x and y, resp., then [rnd_down(x_min+y_min),rnd_up(x_max+y_max)] is guaranteed to contain x+y. So if each floating point number x in a program is represented by an interval [x_min,x_max], and we use rnd_down and rnd_up to get lower and upper bounds on the result of each floating point operation (note: multiply and division are a little trickier than addition), then we can get guaranteed error bounds on the overall computation; this is called interval arithmetic. Alas, naively converting all variables and operations in a program to intervals often results in intervals whose width grows so rapidly, that they provide little useful information. So there has been much research over time in designing new algorithms that try to compute intervals that are narrow at reasonable cost. Google "interval arithmetic" or see the website www.ti3.tu-harburg.de/rump/ for more information. (4) Accurate Linear Algebra despite roundoff: A natural question is what algebraic formulas have the property that there is some way to evaluate them that, despite round off, the final answer is always correct in most of its leading digits: Ex: general polynomial: no, not without higher precision Ex: x^2+y^2: yes, no cancellation, Ex: determinant of a Vandermonde matrix: V(i,j) = x_i^(j-1): General algorithm via Gaussian elimination can lose all digits, but formula det(V) = prod_{i < j} (x_j - x_i) works A complete answer is an open question, but there are some necessary and sufficient conditions based on algebraic and geometric properties of the formula, see article by Demmel/Dumitriu/Holtz/Koev on "Accurate and Efficient Algorithms in Linear Algebra" in Acta Numerica v 17, 2008: A class of linear algebra problems are identified that, like det(Vandermonde), permit accurate solution despite roundoff. We will not discuss this further, just say that the mathematics depends on results going back to Hilbert's 17th problem, which asked whether positive rational (or polynomial) functions could always be written as a sum of squares of other rational (or polynomial) functions (answers: rational = yes, polynomial = no). For example, when 0 < x_1 < x_2 < ... in the Vandermonde matrix V , it turns out most any linear algebra operation on V can be done efficiently and to nearly full accuracy despite roundoff, including eig(V). (This work was cited in coauthor Prof. Olga Holtz's award of the 2008 European Math Society Prize).