Notes for Ma221, Lecture 2, Sep 1 2009

Goals: Floating point arithmetic
       Roundoff error analysis with
          problem of evaluating a polynomial
       Matrix and vector norms

Example: Recall plot of  (x-2)^13, evaluated 2 ways:
       as (x-2)^13 - smooth, monotonic curve, as expected
       as x^13 - 26*x^12 + ... - 8192, with Horner's rule:
           in range [1.8,2.3], basically random numbers
           in range [-1e-8,1e-8]
       why the big difference? need to understand floating point arithmetic

Floating Point - how real numbers represented in computer

   Scientific Notation:
      +- d.ddd x radix^e
   Floating point usually radix 2 (10 sometimes, 16 historically)
      so you need to store sign bit (+-), exponent (e), mantissa (d.ddd)
      #digits, exponent range limited, to fit into 32 bits, 64 bits, etc
   Normalized:  use 3.1000e0 not  0.0031e3 - (leading digits nonzero)
      normalized gives uniqueness of representations 
   Def: rnd(x) = nearest floating point number to x 
   Relative Representation Error (RRE):  | x - rnd(x) | / | rnd(x) |
   Maximum Relative Representation Error  = max_x  RRE(x)
    aka machine epsilon, macheps, 
       = half distance from 1 to next larger number 1+radix^(1-p)
       = .5 * radix^(1-p) = | (1+.5*radix^(1-p)) - 1 | / 1
       = 2^(-p) in binary
    Note: eps in Matlab = 2*macheps
   Roundoff error model, provided no over/underflow
      fl(a op b) = (a op b)(1 + delta),   |delta| <= macheps
         where op may be add, subtract, multiply or divide
      Use throughout course, all you need for most algorithms
      true for complex arithmetic (see Q 1.12 for details, use bigger macheps)
   IEEE single(S)/double(D)/double extended(E)/quad (Q) formats : radix = 2
      S: 1+8+23 bits, -126 <= exp <= 127, p = 24, MRRE = 2^(-24) ~ 6e-8
           overflow threshold (OV) ~  2^128 ~ 1e38, 
           underflow threshold (UN) = 2^(-126) ~ 1e-38
      D: 1+11+52 bits, -1022 <= exp <= 123, p=53, MRRE = 2^(-53) ~ 1e-16
           OV ~  2^1024 ~ 1e308, 
           UN = 2^(-1022) ~ 1e-308
      E: 1+15+64 bits, -16382 <= exp <= 16383, p=64, MRRE = 2^(-64) ~ 1e-19
           OV ~  2^16384 ~ 1e11356, 
           UN = 2^(-16382) ~ 1e-11356
      Q: 1+15+112 bits, -16382 <= exp <= 16383, MRRE = 2^(-113) ~ 1e-36
           OV ~  2^16384 ~ 1e11356,
           UN = 2^(-16382) ~ 1e-11356
   Note: S and D available in hardware in most platforms,
         E on Intel x86 architectures
         Q usually in software, if at all
         higher available via software simulation
           (see ARPREC, GMP on web page)

That's enough for now to understand the plot of (x-2)^13, but more 
about floating point later.

  Analyze Horner for p(x)
     full expression
            p = sum_{i=0 to d} a_i x^i
            p   = a_d, for i=d-1:-1:1, p   = x*p     + a_i
            p_d = a_d, for i=d-1:-1:1, p_i = x*p_i+1 + a_i
            p_d = a_d, for i=d-1:-1:1, p_i = [x*p_i+1(1+d_i) + a_i](1+d'_i)
            p_0 = sum_{i=0:d-1} [(1+d'_i) prod_{j=0:i-1}(1+d_j)(1+d'_j) a_i x^i
                  + prod_{j=0:d-1} (1+d_j)(1_d'_j) a_d x^d
                = sum_{i=0:d-1} [product of 2i+1 terms like 1+d] a_i x^i
                  + [product of 2d terms like (1+d)] a_d x^d

     In words: Horner is backward stable: you get exact value of polynomial
        q(x) at x but with slightly changed coefficients

     how to simplify to get error bound
           prod_{i=1:n} (1 + delta_i)
              <= prod_{i=1:n} (1 + eps)
               = 1 + n*eps + O(eps^2)  ...  usually ignore (eps^2)
              <= 1 + n*eps/(1-n*eps)  if n*eps < 1 ... (lemma left to students)
       similarly
           prod_{i=1:n} (1 + delta_i)
              >= prod_{i=1:n} (1 - eps)
               = 1 - n*eps + O(eps^2)  ...  usually ignore (eps^2)
              >= 1 - n*eps/(1-n*eps)  if n*eps < 1 ... (lemma left to students)

       so  |prod_{1 to n} (1 + delta) - 1| <= n*eps ... ignore eps^2
       and thus
           |computed p_d - p(x)| <= sum_{i=0:d-1} (2i+1)*eps*|a_i x^i|
                                    + 2*d*eps*|a_d x^d|
                                 <= sum_{i=0:d} 2*d*eps*|a_i x^i|

           relerr = |computed p_d - p(x)|/|p(x)|
                 <=  sum_{i=0:d} |a_i x^i| / |p(x)|   *   2*d*eps
                  =  condition number  *  relative backward error

           How many decimal digits can we trust?
               d correct digits <=> relative error = 10^(-d) 
                                <=> -log_10 (relative error) = d

    how to modify Horner to compute error bound
            p = a_d ebnd = |a_d|,
            for i=d-1:-1:1, p = x*p + a_i, eb = |x|*eb + |a_i|
            ebnd = 2*d*macheps*ebnd

     Matlab demo:
       coeff = poly(2*ones(13,1));
       x = [1.6:1e-4:2.4];
       y = polyval(coeff,x);
       yacc = (x-2).^13;
       ebnd = 13*eps*polyval(abs(coeff),abs(x));
       %   note eps in matlab = 2*macheps
       plot(x,y,'k.',x,yacc,'c',x,y-ebnd,'r',x,y+ebnd,'r')
       axis([1.65 2.35 -1e-8 1e-8]), grid
       %   need to make vertical axis wider to see bounds
       axis([1.65 2.35 -1e-6 1e-6]), grid
       %   conclusion: don't trust sign outside roughly [1.72, 2.33]

     Consider Question 1.21: how could we use this error bound to 
       stop iterating in root finding using bisection?
       ... now try wider range, look at actual and estimated # correct digits

       x = -1:.0001:5;
       y = polyval(coeff,x);
       yacc=(x-2).^13;
       ebnd=13*eps*polyval(abs(coeff),abs(x));
       plot(x,-log10(abs((y-yacc)./yacc)),'k.',x,-log10(ebnd./abs(yacc)),'r')
       axis([-1 5 0 16]), grid
       title('Number of correct digits in y')

   Notice the analogy between Horner's rule and computing dot-products:

     p   = a_d, for i=d-1:-1:1, p   = x*p     + a_i
     s   =  0 , for i=1:d,      s   = x_i*y_i + s

   analysis very similar (homework 1.10)

   More about floating point:
    what happens when result too big, too small to fit, or 
    isn't mathematically defined, like 0/0? 
    get exceptions
 
    What if answer < UN? underflow
      What to do: if you return zero, then what happens with code:
          ... suppose a, b, nonnegative
          if (a .ne. b) then x = max(a,b)/(a-b)   
          ... can divide by zero, 
      How big can correct answer be? abs(answer) < 1/macheps
      Instead, get "subnormal numbers"
         x = +- 0.dddd  * 2^exp_min instead of +- 1.dddd * 2^exp
      Draw IEEE number line
      Without subnormals, do error analysis with
          fl(a op b) = (a op b)(1 + delta) + eta,   |delta| <= macheps, 
                    |eta| <= UN
      With subnormals, do error analysis with
          fl(a op b) = (a op b)(1 + delta) + eta,   |delta| <= macheps, 
                |eta| <= UN*macheps for *,/  and eta=0 for +-
      subnormals (underflow), extend error model:
        S: + eta, |eta| <= 2^(-150) ~ 1e-45
        D: + eta, |eta| <= 2^(-1075) ~ 1e-324
        E: + eta, |eta| <= 2^(-16446) ~ 1e-11399
        Thm: With subnormals, for all floats a,b fl(a-b) = 0 iff a=b
        Purpose: simplify reasoning about floating point algorithms

     What is answer > OV? get infinity (overflow)
        fl(1e20*1e20) = fl(1/0) = inf    ... in single
        fl(1/inf) = 0, fl(1/-0) = -inf

     What if answer mathematically undefined? get NaN = Not a Number 
        fl(sqrt(-1)) = fl(0/0) = fl(inf-inf) = NaN
        3 + NaN = NaN ... so you see NaN on output if one occurred

      Flags to check if an exception occurred

    Why bother with exceptions? Why not always just stop when one occurs?
    (1) reliability: too hard to have test before each
         floating point operation to avoid exception
         Ex MatLab: don't want to lose work just because you did 0/0
         Ex for control system (see Ariane 5 crash on webpage), 
    (2) speed: ditto
          (1) run "reckless" code that is fast but assumes no exceptions
          (2) check exception flags
          (3) in case of exception, rerun with slower algorithm
        Ex: s = root-sum-squares = sqrt(sum_{i=1}^n x_i^2)
        What could go wrong? (see Q 1.19)
    (3) sometimes one can prove code correct with exceptions:
        current fastest routine to find (some) eigenvalues of 
           symmetric matrices depends on these, including 1/-0 is -inf

   Possible Class Project: debug recent LAPACK code we sent Mathworks

   Note: IEEE 754 standard, led by Prof. Kahan, who got Turing Award
         for his efforts, revised in 2008, see pointers on class web page

   To generalize error analysis to linear algebra, need to
   generalize absolute value to vectors and matrices, so need norms

Matrix and vector norms

   Def norm : Let B be linear space R^n (or C^n).
       It is normed if there is a function ||.||:B -> R s.t.
         (1) ||x|| >= 0, and ||x|| = 0 iff x=0 (positive definiteness)
         (2) ||c*x|| = |c|*||x|| (homogeneity)
         (3) ||x+y|| <= ||x|| + ||y|| (triangle inequality)
   Examples (1, 2, infinity, 1 < p < inf, norm(C*x), C nonsinglar (why? Q 1.5) )
	pictures of unit balls

   Def inner product: B as above. < .,. >: BxB -> R (or C) is inner product if
         (1) < x,y > = < y,x > (or conj(< y,x >) ) (commutativity)
         (2) < x+z,y > = < x,y > + < z,y > (linearity)
         (3) < c*x,y > = c*< x,y >
         (4) < x,x > >= 0 and < x,x >=0 iff x=0
 
   Example (dot product)(real or complex)
   Def x,y orthogonal if < x,y >=0
   Lemma (1.1): Cauchy Schwartz inquality: |< x,y >| <= sqrt(< x,x >)*sqrt(< y,y >)
     proof: should know it!
   Lemma (1.2): sqrt(< x,x >) is a norm.
     Only prove if students want to see it
     only issue is triangular inequality:
       norm(x+y) <= norm(x) + norm(y) 
     if
       norm(x+y)^2 <= (norm(x) + norm(y))^2
     if
       < x+y,x+y > <= < x,x > + 2sqrt(< x,x >)sqrt(< y,y >) + < y,y >
     if 
       < x,x > + < x,y > + < y,x > + < y,y > <= < x,x > + 2sqrt(< x,x >)sqrt(< y,y >) + < y,y >
     if 
       2 Re < x,y > <= 2 sqrt(< x,x >)sqrt(< y,y >)
     if |< x,y >| <= sqrt(< x,x >)sqrt(< y,y >)
       which follows from Cauchy-Schwartz

   Def (1.6). A symmetric positive definite (spd) if A=A^T and x^T*A*x > 0 for all nonzero x
        A is positive semidefinite if x^T*A*x >= 0 for all nonzero x

   Lemma (1.3): if < y,x > is an inner product if and only if there exists spd A 
          such that < y,x > = y^T*A*x 
     proof that A s.p.d. => y^T*A*x an inner product:
           (1)-(3) easy, (4) follows from def of spd
     proof that < x,y > inner product => there is an spd A s.t. < x,y > = y^T*A*x
           (homework: Q 1.13)

   Lemma (1.4): all norms are equivalent 
       i.e. given any norm_a(x) and norm_b(x), there are nonzero constants
            alpha and beta such that 
            alpha*norm_a(x) <= norm_b(x) <= beta*norm_b(x)
       (proof: compactness)
        
   Lemma (1.5): Equivalences for 2-norm, 1-norm
           norm(x,2) <= norm(x,1) <= sqrt(n)*norm(x,2)
            (picture proof)
         (other examples: homework : Q 1.14)