Math 128a - Spring 2002 - Lecture 2 - Jan 24 (Thursday) Review 64-bit Floating point numbers from last time: store s = sign (1 bit), e = exponent (11 bits), (0 <= e <= 2047) f = fraction (52 bits) (0 <= f < 1) usual case: 0 < e < 2047 (-1)^s * 2^(e-1023) * (1+fraction) (range from about 10^(-308) to 10^(308) e = 2047, f = 0 means x = +- infinity e = 2047, f != 0 means x = NaN = Not a Number 0 = e and f = 0 means x = +-0 0 = e and f != 0, means x is "subnormal" x = (-1)^s 2^(-1022) * (0 + f) (illustrate with Matlab "hex format") What operations are available? Basic four: +, -, *, / Also sqrt(x), remainder(x,y), compare(x,y) binary-decimal conversion Rounding: what do you get from 1/3, which is not an exact floating point number (why not?) Def: if op is one of +, -, *, / then the floating point result of a op b is written fl(a op b), and is equal to the floating point number nearest the true value of a op b. (in case of a tie, choose the floating point number whose bottom bit is 0) Known as "round to nearest", as accurate as possible Exceptions, when "round to nearest" trickier to interpret: Overflow - 1e200^2 = 1e400, too big, get +Inf Underflow - 1e-308^2, get zero 1e-308 / 100, get subnormal number Divide-by-zero - 1/+-0, get +-Inf Invalid - inf-inf, 0/0, sqrt(-1), inf/inf, get NaN For each exception, always get a "sticky bit", a flag that says an exception occurred since bit last reset May continue computing (default) or trap (not available in matlab, just isnan(), etc) Go back and understand infinite loop from last time: Example 2: run bisect to find root of cos(x)=0 in [1,2] [xout,fn,iters]=bisect('cos',[1,2],1e-15,0) Example 3: run bisect to find root of cos(x)=0 in [1,2] [xout,fn,iters]=bisect('cos',[1,2],1e-16,0) [xout,fn,iters]=bisect('cos',[1,2],1e-16,1) Explanation if xlower and xupper are adjacent floating point numbers, then xmid = (xlower+xupper)/2 rounds to xlower or xupper, and no progress is made Note: consider 3 digit decimal: fl((.999+.997)/2) = 1.00; so (xlower+xupper)/2 does not even have to be in the interval [a,b] ! This can't happen with binary numbers (homework) Possible fixes for bisection algorithm: stop if xmid = xlower or xupper stop if number of iterations exceeds some upper bound Both reasonable, but neither of these does what most people want which is to "find me the answer to "5 digits" or "10 digits" as fast as you can" - How do we do this? What does this mean exactly? pi is roughly 3.14159265358979323... pi is 3.1416 to 5 digits pi is 3.141592654 to 10 digits Def: Let x be an approximation to X. then the "absolute error" in x is |x-X| the "relative error" in x is |x-X|/|X| Ex: |3.1416 - pi|/|pi| ~ 10^(-5) (a little less) |3.141592654 - pi|/|pi| ~ 10^(-10) (a little more) Fact: small relative error corresponds to idea of knowing leading digits of answer (details on homework) Returning to bisection, if the user wants, say, 5 digits, then we want to stop as soon as the relative error is < 10^(-5). How do we change the stopping test in the code to do this? How do we change while (width > tol) do a bisection step end to while (relative error > reltol) do a bisection step end Why is this a problem? can't get exact relative error Instead, we will use the approximation relative error ~ width of interval / largest number in interval (on homework you will show that this is a good approximation) New stopping test for bisection: while ( width/ largest number in interval > reltol) do bisection step end Example 4: [xout,fx,iters]=bisect ('cos',[100000,100004],1e-10,0) [xout,fx,iters]=bisect_rel('cos',[100000,100004],1e-10,0) second one converges faster! Example 5: [xout,fx,iters]=bisect ('cos',[1000000,1000004],1e-10,0) [xout,fx,iters]=bisect_rel('cos',[1000000,1000004],1e-10,0) second one converges! Example 6: [xout,fx,iters]=bisection_rel('cube',[-1,2],1e-5,0) yields interval not containing 0, and negative values at both endpoints; why? Proposed fix for bisection algorithm: why? replace test fmid*flower <= 0 by sign(fmid)*sign(flower) <= 0 yields interval with two negative endpoints - why? Example 7: [xout,fx,iters]=bisection_rel('identity',[-1,2],1e-5,0) Now it goes into an infinite loop - why? fix by having absolute lower bound on width: while ( width > max( reltol*largest number in interval, abstol)) do a bisection step end Example 8: f(x) = (x-2)^13, in [1.1,3.2], tol = 1e-15 [xout,fn,iters]=bisect('func',[1.1, 3.2], 1e-15, 0) Example 9: f(x) = (x-2)^13 evaluated using expanded form (polyval in Matlab) tol =1e-10 [xout,fn,iters]=bisect('func',[1+rand(1), 3.2], 1e-10, 0) observe that answer changes a lot depending on rand(1) Example 10: f(x) = (x-2)^13 evaluated using polyval [xout,fn,iters]=bisect_anim('func',[rand(1), 3.2], 1.e-15) Example 11: x=(1:.001:3), y=func(x), plot(x,y), axis([1 3 -1e-8 1e-8]) round off makes answer "junk" but with structure x=2+(0:2000)*2*eps;y=func(x);plot(y,'.') Goals: Want to understand how to bound error in evaluating polynomial Want to modify bisection (or any zero finder) to use error bound to give bound to user on location of zero Relative error is natual way to bound error in floating point UN = smallest normalized floating point number = "Underflow threshold" = 2^(-1022) ~ 10^(-308) OV = largest normalized floating point number = "Overflow threshold" ~ 2^(1024) ~ 10^(308) Prop: Suppose x is a real number, and x* is its closest possible floating point approximation. (eg x = a+b and x* = fl(a+b)) Then provided UN <= |x| <= OV we have that |x-x*|/|x| <= 2^(-53) ~ 1e-16 Notation: 2^(-53) is called "epsilon" or "unit roundoff" or "machine epsilon" or "macheps" for short (note, in matlab eps = 2^(-52) is a built-in constant defined as twice our epsilon!) Proof: Assume w.l.o.g. that x is positive. Write x* = 2^e *(1+f) where 0 <= f < 1 First suppose 0 < f < 1. The two closest floating number to x* are x* - 2^e * 2^(-52) and x* + 2^e * 2^(-52) Then fact that x* is closest to x means that x* - 2^e * 2^(-53) <= x <= x* + 2^e * 2^(-53) or -2^e * 2^(-53) <= x - x* <= 2^e * 2^(-53) or -2^e * 2^(-53) x - x* 2^e * 2^(-53) ------------- <= ----- <= ------------ x x x or | x - x* | / |x| <= 2^e * 2^(-53) / |x| <= 2^e * 2^(-53) / 2^e since |x| >= 2^e * (1+2^(-53)) = 2^(-53) The case f=0 is left as an exercise to the student. Corollary: Let op be one of =,-,*,/. Suppose UN <= |fl(a op b)| <= OV. Then |fl(a op b) - (a op b)| / | a op b | <= 2^(-53) or (*) fl(a op b) = (a op b)*(1+ delta) where |delta| <= epsilon Proof: Define delta = [fl(a op b) - (a op b)] / (a op b) which must satisfy |delta| < eps. Then solve for fl(a op b). We note that IEEE also guarantees that (**) fl(sqrt(a)) = sqrt(a)*(1+delta), |delta| <= eps (*) and (**) are the simple models from which we can derive error bounds. We will do some simple examples, and then polynomial evaluation Example 12: The program p = 1, for i=1:n, p=p*x(i), end computes the product x(1)*x(2)* ... *x(n) What is difference between the true p and the computed p? Approach: write down a "program" that includes all the rounding error explicitly, and evaluate it exactly: p* = 1, for i=1:n, p* = (p* * x(i))*(1 + delta(i)) where delta(i) is the roundoff error in the i-th multiplication. Evaluating, we get p* = x(1)*...*x(n) * (1+delta(1))*...*(1+delta(n)) = p * (1+delta(1))*...*(1+delta(n)) = p * (1+r) where 1+r = (1+delta(1))*...*(1+delta(n)) Ideally, 1+r is as close to 1 as possible. How close is it? Simple bounds: (1-epsilon)^n <= 1+r = (1+delta(1))*...*(1+delta(n) <= (1+epsilon)^n so (1-epsilon)^n - 1 <= r <= (1+epsilon)^n - 1 How to approximate (1+-epsilon)^n? Simplest way: Binomial theorem or Taylor expansion: (1+-epsilon)^n = 1 +- n*epsilon + O((n*epsilon)^2) so if |n*epsilon|<<1, get that -n*epsilon + O((n*epsilon)^2 ) <= r <= n*epsilon + O((n*epsilon)^2 ) For example, n<=100, epsilon ~ 1e-16, means p* approximates p to at least 14 decimal places Approximation ok if n*epsilon << 1 (see below) To be rigorous, and not ignore O(epsilon^2) term, one can prove Lemma: |r| <= n*epsilon / (1 - n*epsilon ) if n*epsilon < 1 Proof: homework This is "worst case analysis", only happens if all delta(i) ~ epsilon, or all delta(i) ~ -epsilon. This is unlikely, but possible, so often answer is more accurate than worst case prediction (examples later) From now on, we will write |r| <= n*epsilon, ignoring second order terms Example 13: The program s = 0; for i=1:n, s=s+x(i), end computes the sum x(1)+x(2)+ ... +x(n) What is difference between the true s and the computed s*? Approach: as beforewrite down a program that includes all the rounding error explicitly, and evaluate it exactly: s* = 0, for i=1:n, s* = (s* + x(i))*(1 + delta(i)) where delta(i) is the roundoff error in the i-th addition This is tricker algebra, so idea is to modify program to have a unique identifier for each intermediate value: s*(0) = 0, for i=1:n, s*(i) = (s*(i-1) + x(i))*(1 + delta(i)) Multiply out a few terms to see pattern (skip induction proof) s*(1) = x(1) ... no roundoff when adding 0 s*(2) = (x(1)+x(2))*(1+delta(2)) s*(3) = ((x(1)+x(2))*(1+delta(2)) + x(3))*(1+delta(3)) = x(1)*(1+delta(2))*(1+delta(3)) + x(2)*(1+delta(2))*(1+delta(3)) + x(3)*(1+delta(3)) s*(4) = x(1)*(1+delta(2))*(1+delta(3))*(1+delta(4)) + x(2)*(1+delta(2))*(1+delta(3))*(1+delta(4)) + x(3)*(1+delta(3))*(1+delta(4)) + x(4)*(1+delta(4)) ... s*(k) = x(1)*(1+delta(2))*...*(1+delta(k)) + x(2)*(1+delta(2))*...*(1+delta(k)) + x(3)*(1+delta(3))*...*(1+delta(k)) + x(4)*(1+delta(4))*...*(1+delta(k)) + ... + x(k)*(1+delta(k)) so s*(n) = sum from i=1 to n of x*(i) where x*(i) = x(i)*(1+r(i)) where |r(i)| <= (n-1)*epsilon In words, s*(n) is the exact sum of a perturbed set of summands x*(i), each of which is a small relative perturbation of x(i) What does this mean? First, suppose all the x(i) are positive; then claim that s*(n) = s(n)*(1+delta), where |delta| <= (n-1)*epsilon Proof: s*(n) - s(n) = sum x(i)*(1+r(i)) - sum x(i) = sum x(i)*r(i) <= sum x(i)*(n-1)*epsilon = s(n)*(1+(n-1)*epsilon) and similary s*(n) >= s(n)*(1-(n-1)*epsilon) What if not all x(i) > 0? then can only show that | s*(n) - s(n) | = | sum_i x(i)*r(i) | <= (n-1)*epsilon* sum_i |x(i)| so error bound same as if all x(i)>0, and so the relative error is | s*(n) - s(n) | / | s(n) | <= (n-1)*epsilon * [ sum |x(i)| / | sum x(i) | ] Thus, if |s(n)| = | sum x(i) | << sum |x(i)|, i.e. if there is a lot of "cancellation" in the sum, then there may be a large relative error: Ex: fl((x+1e16) - 1e16) = 0, not x, for any 0 < |x| < epsilon*1e16 ~ 4.44 so the relative error is 100%. What about polynomial root finding, which involves summing and multiplying? As we approach the the root, we are in precisely the situation where the value of the polynomial (a sum) is supposed to be near zero, i.e. we expect large relative errors. The analysis is similar to summing: Example 14: Horner's rule for polynomial evaluation (used in polyval) ... compute p = sum from i=0 to n of c(i)*x^i p = c(n); for i=n-1:-1:0, p = x*p + c(i) Using same approach as for rounding error analysis in Example 13, we get p*(n) = c(n); for i=n-1:-1:0, p*(i) = (x*p*(i+1)*(1+eps(i)) + c(i))*(1+delta(i)) end with |eps(i)| <= epsilon, |delta(i)| <= epsilon p*(n) = c(n) ... no roundoff p*(n-1) = ( (x * p*(n) * (1+eps(n-1)) + c(n-1)) * (1+delta(n-1)) = [x * c(n)] * (1+eps(n-1))*(1+delta(n-1)) + c(n-1) * (1+delta(n-1)) p*(n-2) = ( (x * p*(n-1)) * (1+eps(n-2)) + c(n-2)) * (1+delta(n-2)) = [x^2 * c*(n)] * (1+eps(n-1))*(1+eps(n-2))*(1+delta(n-1))*(1+delta(n-2) +[x * c(n-1)] * (1+eps(n-2))*(1+delta(n-1))*(1+delta(n-2) + c(n-2) * (1+delta(n-2)) p*(n-k) = [x^k * c(n) ] * (1+eps(n-1))*...*(1+eps(n-k) * (1+delta(n-1))*...*(1+delta(n-k) * +[x^(k-1) * c(n-1) ] * (1+eps(n-2))*...*(1+eps(n-k) * (1+delta(n-1))*...*(1+delta(n-k) * +[x^(k-2) * c(n-2) ] * (1+eps(n-3))*...*(1+eps(n-k) * (1+delta(n-2))*...*(1+delta(n-k) * ... + [c(n-k)] * (1+delta(n-k)) Note that the maximum number of terms of the form (1+tiny) multiplied together is 2*n. Thus p*(n) = sum from i=0 to n of c*(i)*x^i where c*(i) = c(i)*(1+r(i)) where |r(i)| <= 2*n*epsilon That is the computed value p* of the polynomial is the exact value of another polynomial evaluated at the same point x, but with different coefficiencts c*(i), each of which differs from the original c(i) by a small relative changes r(i). Further more, we can write the error bound as | p* - p | = | sum from i=0 to n c*(i)*x^i - c(i)*x^i | = | sum from i=0 to n [c*(i) - c(i)]*x^i | = | sum from i=0 to n [r(i)*c(i)]*x^i | <= sum from i=0 to n |r(i)*c(i)]*x^i | because r(i) can be positive or negative, <= 2*n*epsilon * sum from i=0 to n |c(i)|*|x|^i = 2*n*epsilion * value of a polynomial with coefficients |c(i)| at |x|, i.e. all positive numbers, without cancellation In matlab, if p = polyval(coef,x), then | p* - p | <= 2*n*epsilon * polyval(abs(coef),abs(x)) In practice, factor 2*n is pessimistic, and often omitted. Example: [val,errorbnd]=polyvalbnd(coef,x) [x,val,errorbnd,errorbnd./val] for x=(1:.2:3)' when relative error bound = errorbnd./val > 1, can't trust sign in zero finder Example: [xout,HornerVal,TrueVal,AbsErrorBound,TrueAbsError]= polyplot(roots,x) [...]=polyplot(2*ones(13,1),1:.001:3); [...]=polyplot(1:20,-5:.001:25); Using this error bound in a zero finder, like bisection: 1) compute both f(xmid) and an error bound on f(xmid) 2) stop when error bound > |f(xmid)|, because then sign of f(xmid) in doubt, might as well be zero End of floating point arithmetic details and error analysis, for now: Big picture to remember: value of polynomial computed with roundoff is the exact value of a slightly perturbed polynomial; this is an example of a general phenomenon called "backward stability" that is widely used to analyze the impact of roundoff on an algorithm (Warning: this is the first of several uses of the word "stability" in this course. It has differetn meanings in different contexts, so be aware!)