Notes for Math 221, Lecture 3, Sep 2 2010

Goals: Matrix and vector norms
       Singular Value Decomposition
       Condition numbers for Ax=b

Matrix and vector norms

   Def norm : Let B be linear space R^n (or C^n).
       It is normed if there is a function ||.||:B -> R s.t.
         (1) ||x|| >= 0, and ||x|| = 0 iff x=0 (positive definiteness)
         (2) ||c*x|| = |c|*||x|| (homogeneity)
         (3) ||x+y|| <= ||x|| + ||y|| (triangle inequality)
      Ex: p-norm  = (sum_i |x_i|^p)^(1/p), for p >= 1
          Euclidean norm = 2-norm 
            Note: x real => ||x||_2^2 = sum_i x_i^2 = x^T * x
          infinity-norm = max_i |x_i|
          C-norm =  ||C*x|| where C nonsingular   

   Lemma (1.4): all norms are equivalent 
       i.e. given any norm_a(x) and norm_b(x), there are nonzero constants
            alpha and beta such that 
            alpha*norm_a(x) <= norm_b(x) <= beta*norm_b(x)
       (proof: compactness)

  This lemma is an excuse to use the easiest norm to prove later results.
        
   Def: matrix norm (vector norm on mxn vectors)
         (1) ||A|| >= 0, and ||A|| = 0 iff A=0 (positive definiteness)
         (2) ||c*A|| = |c|*||A|| (homogeneity)
         (3) ||A+B|| <= ||A|| + ||B|| (triangle inequality)

   Ex: max norm = max_ij |A_ij| 
       Frobenius norm = (sum_ij |A_ij|^2)^(1/2)

   Def: operator norm
         norm(A) = max_{nonzero x} norm(A*x)/norm(x)

   Lemma (1.6): operator norm is a matrix norm 
      proof: homework Q 1.15

   Lemma (1.7): if norm(A) operator norm, there exists x such that 
          norm(x)=1 and norm(A*x)=norm(A).
     proof: norm(A) = max_{nonzero x} norm(A*x)/norm(x)
                    = max_{nonzero x} norm(A * x/norm(x) )
                    = max_{unit vector y} norm(A * y )
       y attaining this maximum exists since norm(A*y) is a continuous
       function of y on compact (closed and bounded) set = unit ball

   Notation: Q^* = conj(Q^T), sometimes Q^H
   Def: orthogonal matrix: Q square, real and inv(Q) = Q^T
        unitary matrices:  Q square, complex, and inv(Q) = Q^*
        (for simplicity, state results for real case, but all extend to complex).

   Fact: Q orthogonal <=> Q^T*Q = I <=> all columns mutually orthogonal
       and are unit vectors, Q*Q^T = I implies same about rows

   Fact: norm(Q*x,2)=norm(x,2) - aka Pythagorean theorem
         Proof: norm(Q*x,2)^2 = (Q*x)^T*(Q*x) = x^T*Q^T*Q*x = x^T*x = norm(x,2)^2

   Fact: Q and Z orthogonal => Q*Z orthogonal
         Proof: (Q*Z)^T * (Q*Z) = Z^T*Q^T * Q*Z = Z^T * I * Z = I

   Fact: If Q is m x n, with n= 0
     and A^T*A*Q = Q*Lambda where Q = [q_1,...,q_n], Lambda = diag(l_1,...,l_n)
     so A^T*A = Q*Lambda*Q^T and continuing from above

     norm(A,2) = sqrt ( max_{nonzero x} x^T*Q*Lambda*Q^T*x / x^T*x )
               = sqrt ( max_{nonzero x} x^T*Q*Lambda*Q^T*x / x^T*Q*Q^T*x )
               = sqrt ( max_{nonzero y} y^T*Lambda*y / y^T*y )
                    where y = Q^T*x
               = sqrt ( max_{nonzero y} sum_i y_i^2 l_i / sum_i y_i^2 )
              <= sqrt ( max_{nonzero y} l_max * sum_i y_i^2 / sum_i y_i^2 )
               = l_max, attained by choosing y_max = 1, rest 0

SVD = Singular Value Decomposition
      History: first complete statement goes back to Eckart&Young in 1936,
      First reliable algorithm by Golub & Kahan in 1965, faster ones since then,
      to be discussed in Chapter 5; perhaps best algorithm appears in 
      2010 PhD thesis by Paul Willems, still to be incorporated into LAPACK  
      (learning about, testing this algorithm is a possible class project)

   Thm. Suppose A = m x m, then there is an
           orthogonal matrix U = [u(1),...,u(m)] 
           diagonal matrix Sigma = diag(sigma(1),...,sigma(m)) 
                  with sigma(1) >= sigma(2) >= ... >= 0
           orthogonal matrix V = [v(m),...,v(m)] 
        with A = U*Sigma*V^T. 
           u(i) called left singular vectors
           sigma(i) called singular values
           v(i) called right singular vectors
        More generally, if A = m x n with m > n, then
           U m x m and orthogonal as before
           V n x n and orthogonal 
           Sigma is m x n with same diagonal as before
        When m > n, we sometimes write this as 
           [u(1),...,u(n)] * diag(sigma(1),...,sigma(n)) * V^T

   Geometric interpretation: Thinking of A as a linear mapping from R^n
     to R^m, with the right orthogonal choice of bases 
     of R^n (i.e. columns of V) and R^m (i.e. columns of U) 
     then A is diagonal (i.e. Sigma):  
     A = U*Sigma*V^T => A*V = U*Sigma => A*v(i) = sigma(i)*u(i)

   Proof: Induction on n:
          Two base cases
               n=1: Let U have the first column = A/norm(A,2), rest chosen
                    any way that makes U orthogonal; Sigma = norm(A,2), V = 1
               A=0: Let U = I_m,  Sigma = 0, V = I_n
          Induction step (if A nonzero):
             ||A||_2 = max_x ||A*x||_2 / ||x||_2
                     = max_{x: ||x||_2 = 1} ||A*x||_2
              Let v(1) be x attaining the max, sigma(1) = ||A||_2,
              and u(1) = A*v(1) / ||A*v(1)||_2 .
          Choose V = [v(1),Vhat] to be square & orthogonal
          Choose U = [u(1),Uhat] to be square & orthogonal
          Write Ahat = U^T*A*V
                 = [ u(1)^T*A*v(1)  u(1)^T*A*Vhat ]
                   [ Uhat'*A*v(1)   Uhat'*A*Vhat  ]
                 = [ sigma(1)       A12 ]
                   [ A21            A22 ]
          Show A21 = 0 by def of Uhat
          Show A12 = 0 by def of sigma(1) = ||A||_2
          Apply induction to A22 = U_2*Sigma_2*V_2^T
          so A = U*Ahat*V^T = U * [ sigma(1)        0          ] * V^T
                                  [    0     U_2*Sigma_2*V_2^T ]
               = U * [ 1  0  ] * [ sigma(1)    0    ] * [ 1  0    ] * V^T
                     [ 0 U_2 ]   [   0      Sigma_2 ]   [ 0 V_2^T ]
               = orthogonal * nonnegative_diagonal * orthogonal, as desired

   Thm: The SVD has many useful properties (assume A m x n with m >= n)
      Fact 1: In the square nonsingular case, 
              we can use it to solve A*x=b with O(n^2) more work:
              Proof: X = inv(A)*b = inv(U*Sigma*V^T)*B = (V*inv(Sigma)*U^T)*b
              But note: computing the SVD itself is expensive, O(n^3) (see Chap 5)
              If all you want to do is solve A*x=b, Gaussian Elimination is cheaper.
              On the other hand, we will see that the SVD is more reliable when
              A is nearly singular, and even lets us "solve" A*x=b when A
              is exactly singular.

      Fact 2: When m>n, we can use it to solve the full rank least squares problem
              argmin_x ||A*x-b||_2 as follows: writing A = U*Sigma*V' with U m x n,
              then x = V*inv(Sigma)*U^T*b , same formula as the square case
              Proof: Write A = Uhat*Sigmahat*V^T where Uhat = [U,U'] is m x m and 
                     Sigmahat = [Sigma; 0] is m x n. Then
                     || A*x-b ||_2^2 = || Uhat*Sigmahat*V^T*x - b ||_2^2
                                     = || Uhat^T* ( " )  ||_2^2
                                     = || Sigmahat*V^T*x - Uhat^T*b ||_2^2
                                     = || [ Sigma*V^T*x - U^T*b ] ||^2
                                       || [             - U'^T*b] ||_2
                                     = || Sigma*V^T*x - U^T*b ||_2^2
                                      +|| -U'^T*b ||_2^2
                     is clearly minimized by choosing x = V*inv(Sigma)*U^T*b
                     to zero out the term depending on x

      Def: When A = U*Sigma*V^T is m x n, m >= n, and full rank, then
              A^+ = V*inv(Sigma)*U^T is n x m, and is called the
             Moore-Penrose pseudoinverse of A, 
             This is the most natural extension of the definition of "inverse"
             to rectangular full-rank matrices. 
             We can also use the SVD, and an appropriately defined Moore-Penrose
             pseudoinverse to solve the rank deficient least squares problems,
             or underdetermined problem (m < n), as we describe later.

           Just to solve a least squares problem where you are not
           worried about rank deficiency, the QR decomposition is cheaper.
           On the other hand, we will see that the SVD is more reliable when
           A is nearly singular.

      Fact 3: If A symmetric with eigenvalues Lambda = diag(lambda_1,...,lambda_n)
              and orthonormal eigenvectors V = [v(1),...,v(n)], then its SVD is
              A = V*Lambda*V^T (done if all lambda_i >= 0)
                = (V*D)*(D*Lambda)*V^T where D = diag(sign(lambda(i)))
                = U*Sigma*V^T

      Fact 4: The eigenvalue decomposition of 
                 A^T*A = (U*Sigma*V^T)^T*(U*Sigma*V^T) = V*Sigma^2*V^T
              (what happens if m>n?)

      Fact 5: The eigenvalue decomposition of 
                 A*A^T = (U*Sigma*V^T)*(U*Sigma*V^T)^T = U*Sigma^2*U^T
              (what happens if m>n?)

      Fact 6: Let H = [ 0 A^T ] be (m+n) x (m+n), where we assume A is m x n
                      [ A  0  ]
              Then H has eigenvalues +- sigma(i) 
              and eigenvectors 1/sqrt(2)*[v(i) ; +- u(i)]
              Proof: plug in A = U*Sigma*V^T

       Fact 6 suggests that algorithms for the SVD and the symmetric eigenproblem
         will be closely related (see Chap 5)

      Fact 7: ||A||_2 = sigma(1), ||inv(A)||_2 = 1/sigma(n) and 

         Def: kappa(A) = sigma(1)/sigma(n) is called the condition number of A
              for reasons we will see shortly

      Fact 8: Let S be the unit sphere in R^n. Then A*S is an ellipsoid centered 
              at the origin with principal axes sigma(i)*u(i)
              Proof: write A*S = U*Sigma*V^T*S = U*Sigma*S = sum_i u(i)*sigma(i)*w(i) 
               where sum_i w(i)^2 = 1
*             (matlab demo a = randn(2,2), svddemo2; a = randn(3,3); svddemo3)

      Fact 9: Suppose sigma(1) >= ... >= sigma(r) > 0 = sigma(r+1) = ... = sigma(n).
              Then A has rank r; the null-space of A is span(v(r+1),...,v(n)), 
              of dimension n-r, and the range space of A is span(u(1),...,u(r)), 
              of dimension r

      Fact 10: Matrix A_k of rank k closest to A in 2-norm is 
               A_k = sum_{i=1 to k} u_i*sigma(i)*v(i)^T = U*Sigma_k*V^T
               where Sigma_k = diag (sigma(1) , ... , sigma(k), 0, ... 0)
               and the distance is norm(A - A_k , 2) = sigma(k+1)
               Proof: easy to see that A_k has right rank, right distance to A; 
               need to show no closer one:
               Suppose B has rank k, so null space has dimension n-k. 
               The space spanned by {v(1),...,v(k+1)} has dimension k+1. 
               Since the sum of the dimensions (n-k)+(k+1) > n, these two spaces
               must overlap (can't have > n independent vectors in R^n).
               Let h be unit vector in their intersection. then
               norm(A-B,2) >= norm((A-B)*h,2) = norm(A*h,2)
                            = norm(U*Sigma*V^T*h,2) = norm(Sigma*V^T*h,2)
                            = norm(Sigma*[x(1),...,x(k+1),0,...,0]^T)
                           >= sigma(k+1)
     (matlab demo: load clown.mat, [U,S,V]=svd(X); colormap('gray')
     figure(1), clf, image(X),
     figure(2), clf, for k=[1:10,20:10:200], Xk=U(:,1:k)*S(1:k,1:k)*(V(:,1:k))'; ...
       err = norm(X-Xk)/norm(X); compr = k*(200+320+1)/(200*320);
       image(Xk), colormap('gray'), ...
       title(['k= ',int2str(k),' err= ', num2str(err),' compression= ',num2str(compr)]), 
       pause, end
     (Note: jpeg compression algorithm uses similar idea, on subimages)

Now start using this material to analyze the condition number 
for matrix inversion and solving Ax=b: If A (and b) change
a little bit, how much can inv(A) (and x=inv(A)*b) change?

  If |x|<1, recall that 1/(1-x) = 1+x^2+x^3+...
  Now generalize to matrices:
  Lemma: If operator norm(X)<1, then I-X is nonsingular and
          inv(I - X) = sum_{i>=0} X^i and
          norm(inv(I-X)) <= 1/(1-norm(X))
            proof: claim I + X + X^2 + ... converges:
               norm(X^i) <= norm(X)^i -> 0 as i increases
               so by equivalence of norms max_{kj} | (X^i)_kj | <= C* norm(X)^i
               and so kj-th entry of I + X + X^2 ... dominated by
               convergent geometric sequence and so converges
                   claim (I - X)*( I + X + X^2 + ... + X^i) 
                            = I - X^(i+1) -> I as i increases
               next norm(I + X + X^2 + ...)
                   <= norm(I) + norm(X) + norm(X^2) + ... by triangle inequality
                   <= norm(I) + norm(X) + norm(X)^2 + ... by ||A*B||<=||A||*||B||
                    =   1     + norm(X) + norm(X)^2 + ... by def of operator norm
                    =   1/(1 - norm(X))               ... geometric sum

   (Later: generalize to arbitrary Taylor expansions, like e^X = sum_i X^i/i!)
              
   Lemma: Suppose A invertible. Then A-E invertible if norm(E) < 1/norm(inv(A)) 
          in which case
          inv(A-E) = Z + Z(EZ) + Z(EZ)^2 + Z(EZ)^3 + ... where Z=inv(A)
          and norm(inv(A-E)) <= norm(Z)/(1-norm(E)*norm(Z))
            proof: inv(A-E) = inv((I-E*Z)*A) = Z*inv(I-E*Z)
              exists if I-E*Z invertible i.e. if
              norm(E*Z) < 1 i.e. if
              norm(E)*norm(Z) < 1 in which case
              inv(A-E) = Z*(1+EZ + (EZ)^2 + ...)
              take norms 

       What does this say for 1x1 matrices?
       Why can't we write inv(A-E) = Z + EZ^2 + E^2Z^3 + ... ?

   Finally, we can ask how much inv(A) and inv(A+E) differ.
   Lemma: Suppose A invertible. If norm(E) < 1/norm(Z), then
          norm(inv(A-E)-Z) <= norm(Z)^2*norm(E)/(1-norm(E)*norm(Z))
           proof: inv(A-E)-Z = Z(EZ) + Z(EZ)^2 + ...
                             = ZEZ ( I + EZ + (EZ)^2 + ...)
                  and take norms

          relative change in inv(A) is
           norm(inv(A-E)-Z)/norm(Z)
          <= [ norm(A)*norm(Z) * 1/(1 - norm(E)*norm(Z)) ] * [ norm(E)/norm(A) ] 
           = [ norm(A)*norm(Z) ] * [ norm(E)/norm(A) ] + O(norm(E)^2)
           = condition_number * relative_change_in_A

       What does this say for 1x1 matrices?
         
   Thus justifies the following definition
   DEF    condition number kappa(A) = norm(A)*norm(Z)
   Fact:  kappa(A) >= 1.
       proof: 1 = norm(I) = norm(A*Z) <= norm(A)*norm(Z)

   Theorem: min{norm(E)/norm(A): A+E singular} 
               = relative_distance(A, {singular matrices})
               = 1/kappa(A)
       proof for 2-norm, using SVD: min{norm(E)}: A+E singular} = sigma_min(A),
              so relative_distance(A,{singular}) = sigma_min(A)/sigma_max(A)
              And norm(Z) = norm(inv(A)) = norm(V*inv(Sigma)*U^T) = norm(inv(Sigma))
                          = 1/sigma_min(A)

   We've looked at sensitivity of inv(A), now look at solving A*x=b
   Now consider A*x = b vs (A-E)*x' = b+f where x' = x+dx
                subtract to get
                    A*dx - E*x - E*dx = f
                    (A-E)dx = f + E*x
                    dx = inv(A-E)*(f + E*x)
                    norm(dx) = norm(inv(A-E)*(f + E*x))
                            <= norm(inv(A-E))*(norm(f) + norm(E)*norm(x))
                            <= norm(Z)/(1-norm(E)*norm(Z))*(norm(f) + norm(E)*norm(x))
                    norm(dx)/norm(x)
                            <= norm(Z)*norm(A) * 1/(1-norm(E)*norm(Z))
                              *( norm(f)/(norm(A)*norm(x) + norm(E)/norm(A))
                            <= norm(Z)*norm(A) * 1/(1-norm(E)*norm(Z))
                              *( norm(f)/(norm(b) + norm(E)/norm(A))

                   relerr in x <= kappa(A) 
                                  * something close to 1 unless A nearly singular
                                  * (rel change in b + rel change in A)

     Our algorithms will attempt to guarantee that 
      (*)   computed solution of A*x=b is (A-E)*x' = b+f where 
            norm(f)/norm(b) = O(macheps) and norm(E)/norm(A) = O(macheps)
         so relerr in x ~ O( kappa(A)*macheps )
        Recall that property (*) is called "backward stability" 

       Another practical approach: given x', how accurate a solution is it?
       compute residual r = A*x'-b = A*x'-A*x = A*(x'-x) = A*error
       so error = inv(A)*r and norm(error) <= norm(inv(A))*norm(r)
       norm(r) also determines the backward error in A
    Theorem: The smallest E such that (A+E)x' = b has norm(E) = norm(r)/norm(x')
      proof: r = A*x' - b = -E*x' so norm(r) <= norm(E)*norm(x')
      to attain lower bound on norm(E), choose E = -r*x'^T/norm(x')^2, in 2 norm

   In other words, if Ax'-b is small, then the backwards error is small,
   which is probably the most you should ask for when the entries of A are uncertain.
   (Later we will see how to use "iterative refinement", aka Newton's method,
    to get a tiny error in x as long as kappa(A) is not about 1/macheps or larger,
    but this is only sometimes justified.)

   All our practical error bounds depend on norm(inv(A)). To actually compute inv(A)
   costs several times as much as just solving A*x=b (2n^3 versus (2/3)n^3), 
   so we will use cheap estimates of norm(inv(A)) that avoid computing 
   inv(A) explicitly, and work with small probability of large error.

   The idea is to use the definition
      || inv(A) || =  max_{||x|| = 1} || inv(A)*x || =  max_{||x||<= 1} || inv(A)*x ||
     and do gradient ascent ("go uphill") on  || inv(A)*x ||
     on the convex set ||x|| <= 1; one may also start with a random starting vector.
     For right choice of norm it is easy to figure out ascent direction,
     and each step requires solving Ax=b for some b, which only costs O(n^2),
     (assuming we have already done LU factorization).
     In practice it usually takes at most 5 steps or so to stop ascending 
     so the total costs is O(n^2); see text for details

   In fact there is a theorem  (Demmel, Diament, Malajovich, 2000) that says that
   estimating kappa(A) even roughly, but still with some guarantee (say "to within
   a factor of 10", or within any constant factor) is as about expensive as 
   multiplying matrices, which in turn is about as expensive as doing Gaussian
   elimination in the first place.
   Since our goal is to spend just an additional O(n^2) to estimate norm(inv(A)),
   given that we have already done Gaussian Elimination (LU factorization),
   this theorem implies that we need to settle for a small probabilitly of 
   getting a large error in our estimate of norm(inv(A)).

  Where to find implementations of all this?
    Matlab: A\b  or [P,L,U]=lu(A) or condest or normest1
    LAPACK: xGETRF just for GEPP where x = S/D/C/Z
            xGESV to solve A*x=b
            xGESVX for condition estimation, more
            xGECON for condition estimation alone
    CLAPACK: similar
    ScaLAPACK: PxGESV, etc