Notes for Math 221, Lecture 3, Sep 3 2009

Goals: Finish matrix and vector norms
       geometric interpretation of condition numbers
       sensitivity of solving Ax=b

Matrix and vector norms

   Recall: 
   Def norm : Let B be linear space R^n (or C^n).
       It is normed if there is a function ||.||:B -> R s.t.
         (1) ||x|| >= 0, and ||x|| = 0 iff x=0 (positive definiteness)
         (2) ||c*x|| = |c|*||x|| (homogeneity)
         (3) ||x+y|| <= ||x|| + ||y|| (triangle inequality)
      Ex: p-norms, inf-norm, C-norm  ||C*x|| where C nonsingular   

   Lemma (1.4): all norms are equivalent 
       i.e. given any norm_a(x) and norm_b(x), there are nonzero constants
            alpha and beta such that 
            alpha*norm_a(x) <= norm_b(x) <= beta*norm_b(x)
       (proof: compactness)

  This lemma is an excuse to use the easiest norm to prove later results.
        
   Def: matrix norm (vector norm on mxn vectors)
         (1) ||A|| >= 0, and ||A|| = 0 iff A=0 (positive definiteness)
         (2) ||c*A|| = |c|*||A|| (homogeneity)
         (3) ||A+B|| <= ||A|| + ||B|| (triangle inequality)

   Ex: max, Frobenius norms

   Def: operator norm
         norm(A) = max_x nonzero norm(A*x)/norm(x)

   Lemma (1.6): operator norm is a matrix norm 
      proof: homework Q 1.15

   Lemma (1.7): if norm(A) operator norm, there exists x such that 
          norm(x)=1 and norm(A*x)=norm(A).
     proof: norm(A) = max_x nonzero norm(A*x)/norm(x)
                    = max_x nonzero norm(A * x/norm(x) )
                    = max_y unit norm(A * y )
       y attaining this maximum exists since norm(A*y) is a continuous
       function of y on compact (closed and bounded) set = unit ball

   Notation: Q^* = conj(Q^T), sometimes Q^H
   Def: orthogonal matrix: Q square, real and inv(Q) = Q^T
        unitary matrices:  Q square, complex, and inv(Q) = Q^*

   Fact: Q orthogonal <=> Q^T*Q = I <=> all columns mutually orthogonal, 
       unit vectors, Q*Q^T = I implies same about rows

   Fact: norm(Q*x,2)=norm(x,2) - Pythagorean theorem
         norm(Q*x,2)^2 = (Q*x)^T*(Q*x) = x^T*Q^T*Q*x = x^T*x = norm(x,2)^2

   Lemma (most proofs in homework, Q1.16)
     norm(A*x) <= norm(A)*norm(x) for vector and its operator norm
     norm(A*B) <= norm(A)*norm(B) for operator norm
     norm(Q*A*Z,2) = norm(A,2) if Q, Z orthogonal
            (Pythagorean theorem - proof for Q only)
     norm(Q,2)=1
     norm(A,2) = sqrt(lambda_max(A^T*A))
   proof of last part only:
     norm(A,2) = max_x nonzero norm(A*x)/norm(x)
               = max_x nonzero sqrt( (A*x)^T*(A*x) )/ sqrt(x^T*x)
               = sqrt ( max_x nonzero (A*x)^T*(A*x) / (x^T*x) )
               = sqrt ( max_x nonzero x^T*A^T*A*x / x^T*x )

   Use fact that A^T*A is symmetric and so has eigenvalue decomposition
     A^T*A*q_i = l_i * q_i where l_i real, q_i real, unit, orthogonal
     thus l_i = l_i * q_i^T * q_i = q_i^T A^T A q_i = (A q_i)^T(A q_i) >= 0
     and A^T*A*Q = Q*Lambda where Q = [q_1,...,q_n], Lambda = diag(l_1,...,l_n)
     so A^T*A = Q*Lambda*Q^T and continuing from above

     norm(A,2) = sqrt ( max_x nonzero x^T*Q*Lambda*Q^T*x / x^T*x )
               = sqrt ( max_x nonzero x^T*Q*Lambda*Q^T*x / x^T*Q*Q^T*x )
               = sqrt ( max_y nonzero y^T*Lambda*y / y^T*y )
                    where y = Q^T*x
               = sqrt ( max_y nonzero sum_i y_i^2 l_i / sum_i y_i^2 )
              <= sqrt ( max_y nonzero l_max * sum_i y_i^2 / sum_i y_i^2 )
               = l_max, attained by choosing y_max = 1, rest 0
     norm(A^T,2) = norm(A,2)

Now for a common property of condition numbers, that uses norms

   This is a geometric property of condition numbers that is
    true for many (and in an approximate sense all) of the
    condition numbers we encounter.

    Simple case: evaluating scalar function
        |f(x+dx) - f(x)| / |f(x)| ~ |f'(x)*x|/|f(x)| * |dx|/|x|
        rel change of output f(x) ~ cond(x) * rel change of input x

   Def: A problem x  is "ill-posed" if cond(x) = inf
   ASK: for evaluating scalar function f(x), what is {ill-posed problems}?
        ANS: f(x) = 0, simple zero,  try f(x) = x-1 at 1
             f(x) = 0, multiple zero , try f(x) = (x-1)^2 at 1
             f(x) = inf, simple pole, try f(x) = 1/(x-1) at 1
   Suppose cond(x) is very large, intuition is that must x be close
      to {ill-posed problems},  how close?

       Newton's Method for simple root => if f(x) tiny enough,
          root ~ x - f(x)/f'(x)  so root - x ~ -f(x)/f'(x)
          where f(root) = 0, i.e. root in {ill-posed}
           so  rel dist to {ill-posed} ~ |(root - x)/x|
                                       ~ |-f(x)/(f'(x)*x)|
                                       = 1 / cond(x)

           or cond(x) ~ 1/ relative distance to nearest ill-posed problem

        Ex: if f(x) = prod_i (x-r(i)) then 
            cond(x) = |f'(x)*x/f(x)| = | sum_i  x/(x-r(i)) |
            so if x very close to one r(j), then 
               cond(x) ~ |x/(x-r(j))| = 1/relative distance from x to r(j)
           
        Is also true for polynomial evaluation at fixed x: 
              what is the function? f(coefficients) = value of polynomial at x
                what is {ill-posed}? polynomials whose value at x is zero
              from earlier analysis in class, cond = sum_i |a(i)*x^i| / | sum_i a(i)*x^i |
                See Thm 1.2 in text for proof

        Is also true for matrix inversion: what is {ill-posed}?
              {singular matrices} - we'll prove this more carefully

Now start using this material to analyze matrix inversion

  Lemma: If operator norm(X)<1, then I-X is nonsingular and
          inv(I - X) = sum_{i>=0} X^i and
          norm(inv(I-X)) <= 1/(1-norm(X))
            proof: claim I + X + X^2 + ... converges:
               norm(X^i) <= norm(X)^i -> 0 as i increases
               so by equivalence of norms max_{kj} | (X^i)_kj | <= C* norm(X)^i
               and so kj-th entry of I + X + X^2 ... dominated by
               convergent geometric sequence and so converges
                   claim (I - X)*( I + X + X^2 + ... + X^i) 
                            = I - X^(i+1) -> I as i increases
               next norm(I + X + X^2 + ...)
                   <= norm(I) + norm(X) + norm(X^2) + ... by triangle
                   <= norm(I) + norm(X) + norm(X)^2 + ... by Lemma
                    =   1     + norm(X) + norm(X)^2 + ... by def of operator norm
                    =   1/(1 - norm(X))               ... geometric sum

       What does this say for 1x1 matrices?
              
   Lemma: Suppose A invertible. Then A-E invertible if norm(E) < 1/norm(inv(A)) 
          in which case
          inv(A-E) = Z + Z(EZ) + Z(EZ)^2 + Z(EZ)^3 + ... where Z=inv(A)
          and norm(A-E) <= norm(Z)/(1-norm(E)*norm(Z))
            proof: inv(A-E) = inv((I-E*Z)*A) = Z*inv(I-E*Z)
              exists if I-E*Z invertible i.e. if
              norm(E*Z) < 1 i.e. if
              norm(E)*norm(Z) < 1 in which case
              inv(A-E) = Z*(1+EZ + (EZ)^2 + ...)
              take norms 

       What does this say for 1x1 matrices?

   Finally, we can ask how much inv(A) and inv(A+E) differ.
   Lemma: Suppose A invertible. If norm(E) < 1/norm(Z), then
          norm(inv(A-E)-Z) <= norm(Z)^2*norm(E)/(1-norm(E)*norm(Z))
           proof: inv(A-E)-Z = Z(EZ) + Z(EZ)^2 + ...
                             = ZEZ ( I + EZ + (EZ)^2 + ...)
                  and take norms

          relative change in inv(A) is
           norm(inv(A-E)-Z)/norm(Z)
          <= [ norm(A)*norm(Z) * 1/(1 - norm(E)*norm(Z)) ] * [ norm(E)/norm(A) ] 
           = [ norm(A)*norm(Z) } * [ norm(E)/norm(A) ] + O(norm(E)^2)

       What does this say for 1x1 matrices?
         
   Thus justifies the following definition
   DEF    condition number kappa(A) = norm(A)*norm(Z)
   Fact:  kappa(A) >= 1.
       proof: 1 = norm(I) = norm(A*Z) <= norm(A)*norm(Z)

   Theorem: min{norm(E)/norm(A): A+E singular} = dist(A, {singular matrices})
               = 1/kappa(A)
       proof: lemma above shows inv(A+E) exists if norm(E) < 1/norm(Z), or
              norm(E)/norm(A) < 1/kappa(A),  so dist >= 1/kappa(A)
              Still need to show that there exists E such that A+E singular, 
              with norm 1/kappa(A). We'll just do the 2-norm:
              Need to make I-ZE singular, i.e. choose E so that
              norm(E)=1/norm(Z) and norm(ZE) large as possible.
              Let's try unit x,y so norm(Zx)=norm(Z) and y=Zx/norm(Zx)
              then E = x*y^T/norm(Z). Then
              (A-E)y = A(I-ZE)y = A(y-Zx*y^T*y/norm(Z)) = A(y-y)=0
              and norm(E) = norm(x*y^T)/norm(Z)
                          = norm(x)*norm(y)/norm(Z)
                          = 1*1/norm(Z)

   We've looked at sensitivity of inv(A), now look at solving A*x=b
   Now consider A*x = b vs (A-E)*x' = b+f where x' = x+dx
                subtract to get
                    A*dx - E*x - E*dx = f
                    (A-E)dx = f + E*x
                    dx = inv(A-E)*(f + E*x)
                    norm(dx) = norm(inv(A-E)*(f + E*x))
                            <= norm(inv(A-E))*(norm(f) + norm(E)*norm(x))
                            <= norm(Z)/(1-norm(E)*norm(Z))*(norm(f) + norm(E)*norm(x))
                    norm(dx)/norm(x)
                            <= norm(Z)*norm(A) * 1/(1-norm(E)*norm(Z))
                              *( norm(f)/(norm(A)*norm(x) + norm(E)/norm(A))
                            <= norm(Z)*norm(A) * 1/(1-norm(E)*norm(Z))
                              *( norm(f)/(norm(b) + norm(E)/norm(A))

                   relerr in x <= kappa(A) * something close to 1 unless A nearly singular
                                           * (rel change in b + rel change in A)

     Our algorithms will attempt to guarantee that 
      (*)   computed solution of A*x=b is (A-E)*x' = b+f where 
            norm(f)/norm(b) = O(macheps) and norm(E)/norm(A) = O(macheps)
         so relerr in x ~ O( kappa(A)*macheps )
        Recall that property (*) is called "backward stability" 

       Another practical approach: given x', how accurate a solution is it?
       compute residual r = A*x'-b = A*x'-A*x = A*(x'-x) = A*error
       so error = inv(A)*r and norm(error) <= norm(inv(A))*norm(r)
       norm(r) also determines the backward error in A
    Theorem: The smallest E such that (A+E)x' = b has norm(E) = norm(r)/norm(x')
      proof: r = A*x' - b = -E*x' so norm(r) <= norm(E)*norm(x')
             to attain lower bound on norm(E), choose E = -r*x'^T/norm(x')^2, in 2 norm

   All our practical error bounds depend on norm(inv(A)). To actually compute inv(A)
   costs several times as much as just solving A*x=b, so we will use cheap,
   estimates of norm(inv(A)) that avoid computing inv(A) explicitly, and 
   work with small probability of large error.
   In fact there is a theorem  (D., Diament, Malajovich, 2000) that says that
   estimating kappa(A) even roughly (but still with some guarantee, say "to within
   a factor of 10", or wihtin a factor of any constant) is as about expensive as 
   multiplying matrices, which in turn is about as expensive as inverting them.