Notes for Math 221, Lecture 4, Sep 8 2009

Goal: Gaussian Elimination

   Overall plan:
      Derive basic algorithm (twice)
      Need for Pivoting to keep backward error small
      Error Analysis and practical error bounds
      How to implement matmul and then GE efficiently
      Important variations exploiting structure (spd, banded, sparse,...)
      Iterative refinement to improve an approximate answer

   Def Permutation matrix: identity with permuted rows
   Facts: let P, P1 etc be permutation matrices
      P*X = row permutation of X
      X*P = col permutation of X
      P1*P2=P3
      inv(P) = P^T (enough to check that diag(P*P^T) = all ones
      det(P) = +-1
      P is an orthogonal matrix
      to store and multiply by P, just keep track of order (cheap)

   Algorithm: GE with pivoting
      1) Factor A=P*L*U where P is a perm,  L unit lower triangular, U upper triangular
         (cost = (2/3)n^3 +O(n^2) flops, the hard part)
      2) Solve P*L*U*x=b for L*U*x = P^T*x (permute) (no flops)
      3) Solve L*U*x=P^T*X for U*x = inv(L)*P^T*x (substitution) (cost = n^2 flops)
      4) Solve U*x=inv(L)*P^T*X for x = inv(U)*inv(L)*P^T*x (substitution) (cost = n^2 flops)

  Note: We do not compute inv(A) and multiply by it! 
        This both takes 3x as many flops for a dense matrix and is less accurate
  Note: once we have A = P*L*U, handling more right-hand-sides b is cheap, O(n^2)
  Will derive algorithm for 1) in couple of ways, show they are equivalent
  First will deal with case where don't need P (or imagine we start with P^T*A,
  and compute P^TA = L*U)

  Approach 1: Start with how you may have learned it long ago:
     "Take each row of the matrix, and add multiples of it to later rows
      to zero out each column below the diagonal, until you get a triangular matrix U"

     for i = 1:n-1   ... for each column i, zero it out below diagonal
       for j = i+1:n  ... for each row j below i
            m = A(j,i)/A(i,i) ... subtract a multiple m of row i from row j
            for k = i to n ... for each possibly nonzero entry of row j
                A(j,k) = A(j,k) - m*A(i,k)  

     Let's "optimize" this:
       (1) don't bother computing zero entries:  
               line 4 becomes for k=i+1 to n, so A(j,i) left alone
       (2) save multipliers m in these locations that would have been zeroed out:
             line 3 becomes A(j,i) = A(j,i)/A(i,i)
             line 5 becomes A(j,k) = A(j,k) - A(j,i)*A(i,k)
       (3) split into two loops: compute all the A(j,i) first, then use them:

     for i = 1:n   ... for each column i, zero it out below diagonal
       for j = i+1:n  ... for each row j below i
            A(j,i) = A(j,i)/A(i,i) ... add a multiple m of row i to row j
       for j = i+1:n  ... for each row j below i
            for k = i to n ... for each entry of row j
                A(j,k) = A(j,k) - A(j,i)*A(i,k)  

       (4) use Matlab notation for loops

     for i = 1:n
       A(i+1:n) = A(i+1:n)/A(i,i)  ... scale column by (1/A(i,i))
       A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i) * A(i, i+1:n) 
           ... subtract rank-1 matrix from trailing n-i x n-i submatrix

    (draw picture)

    After running this algorithm, A has been overwritten by new entries.
    Let upper triangular part of result be called U
    Let strict lower triangular part of result, containing saved multipliers,
    be called M, so L = I+M is unit lower triangular

    Thm: Assuming we don't divide by 0, A = L*U
    Proof: derive same algorithm differently

  Def: Leading Principal Submatrix of A is any A(1:i,1:i)

  Thm: A=LU with L unit lower triangular and U upper triangular and nonsingular
       exists and is unique iff all Leading Principal Submatrices (LPSs) of A nonsingular
  Corollary: If A nonsingular there exist permutations P1 and P2 such that
             P1*A*P2 = L*U (only need one of P1 or P2, not both)
             where L is unit lower triangular and U is nonsingular upper triangular

  Standard algorithm as embodiment of proof of part of one direction  (A=LU exists)

  Notation for block matrix multiplication, to be used frequently:
     A * B = C can be written [ A11 A12 ] * [ B11 B12 ] = [ C11 C12 ]
                              [ A21 A22 ]   [ B21 B22 ]   [ C21 C22 ]
     where Cij = Ai1*B1j + Ai2+B2j, true as long as dimensions compatible:
      #rows(Cij) = #rows(Aik) for all i,j,k
      #cols(Cij) = #cols(Bkj) for all i,j,k
      #cols(Aik) = #rows(Bkj) for all i,j,k

  Write A = [ A11 A12 ] = [ 1   0 ] * [ U11  U12 ] = [ U11      U12           ]
            [ A21 A22 ]   [ L21 I ]   [  0   U22 ]   [ L21*U11  L21*U12 + U22 ]
  where A11 and U11 1 x 1, A11 nonzero by assumption, 
        A21 and L21 (n-1) x 1
        A12 and U12 1 x (n-1)
        A22 and U22 (n-1) x (n-1)
  For this "2x2 block triangular factorization" to hold we need to set
         U11 = A11       U12 = A12
         A21 = L21*U11   A22 = L21*U12 * U22
  or
         L21 = A21/A11   U22 = A22 - L21*U12
  Notation: A11 called "pivot", U22 called "Schur complement"

  Want to use induction on U22, so need to show
  that all LPSs of U22 are nonsingular (draw picture)
  Write above factorization as A = Lhat*Uhat, so
     0 neq det(A(1:i,1:i)) = det(Lhat(1:i,1:i)*Uhat(1:i,1:i))
                           = det(Lhat(1:i,1:i))*det(Uhat(1:i,1:i))
                           =      1            * U11 * det(U22(1:i-i,1:i-i))
     so det(U22(1:i-1,1:i-1)) is nonzero as desired

  Then by induction U22 = L' * U' and we get
  A = [ A11 A12 ] = [ 1   0 ] * [ U11  U12   ] = [ 1   0  ] * [ U11 U12 ]
      [ A21 A22 ]   [ L21 I ]   [  0   L'*U' ]   [ L21 L' ]   [  0  U'  ]
    = unit lower triangular * upper triangular as desired

  What happens if we try to write this as algorithm? get same as before
     for i=1:n  ... for each step of induction
         L21        = A21        / A11    ...  compute L21, i.e.
         A(i+1:n,i) = A(i+1:n,i) / A(i,i) ...  overwrite L on A as before
         U22            = A22            - L21        * U12   ... compute U22, i.e.
         A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i) * A(i,i+1:n)  ...  A turns into U

  What about uniqueness?  L1*U1 = L2*U2 implies
   inv(L2)*L1 = U2*inv(U1) or
   unit lower triangular = upper triangular, so both identity

  What about other direction, all LPS nonsingular? Get LU decomp of each submatrix,
     and det(L(1:n,1:i))*det(U(1:i,1:i)) = det(A(1:i,1:i))
     or    nonzero      * nonzero        = nonzero
 
  Next: need permutations when A just nonsingular

  Induction again: If A nonsingular, it has a nonzero somewhere in first column (resp row)
     so permute rows (resp columns) so that (1,1) entry of P1*A  (resp. A*P2) is nonzero
     (only one of P1 and P2 necessary)
     Do 2x2 block triangular factorization as before
        P1*A*P2 = [ A11 A12 ] = [ 1   0 ] * [ U11 U12  ]  
                  [ A21 A22 ]   [ L21 I ]   [  0  U22  ]   
        so U11 = A11, U12 = A12, L21 = A21/A11, U22 = A22-L21*U12
        Apply induction to U22, since  det(A) = +- U11*det(U22) nonzero, so
           P1'*U22*P2' = L'*U' or
        P1*A*P2 = [ 1   0 ] * [ U11 U12               ]
                  [ L21 I ]   [  0  P1'^T*L'*U'*P2'^T ]
                = [ 1   0        ] * [ U11 U12      ]
                  [ L21 P1'^T*L' ]   [  0  U'*P2'^T ]
                = [ 1   0   ] * [ 1       0  ] * [ U11 U12      ]
                  [ 0 P1'^T ]   [ P1'*L21 L' ]   [  0  U'*P2'^T ]
                = [ 1   0   ] * [ 1       0  ] * [ U11 U12*P2' ] * [ 1  0     ]
                  [ 0 P1'^T ]   [ P1'*L21 L' ]   [  0  U'      ]   [ 0  P2'^T ]
        or
        [ 1  0   ] * P1 * A * P2 * [ 1  0  ] = [ 1       0 ] * [ U11 U12*P2' ]
        [ 0  P1' ]                 [ 0 P2' ]   [ P1'*L21 L']   [  0  U'      ]
        or
        (perm * perm) * A * (perm * perm) = unit lower triangular * upper triangular
        as desired.

  Why bother with P1 and P2? 
    (1) for dense matrices, in practice, just use P1 as described later
    (2) some rare cases where errors much smaller by picking (1,1) entry
        from somewhere other than first column, to make it bigger
    (3) for sparse matrices, need both P1 and P2 to keep L and U sparse
        and save a lot of work


  We need to do pivoting, even if we wouldn't divide by zero:
  Suppose we run in single precision, 7 decimal digits
      A = [ 1e-8   1 ],  inv(A) ~ [ -1   1  ] , 
          [    1   1 ]            [  1 -1e-8]
      cond(A) ~ 2.6, really small, so should get accurate answers
      L = [    1   0 ]   U = [ 1e-8  1                 ]
          [  1e8   1 ]       [  0  fl(1-1e8*1) = -1e8  ]
    but L*U = [  1e-8   1 ]
              [   1     0 ]  
    (2,2) wrong, same answer for A(2,2)=-1, 0, +1...  
    Answer L,U  nearly independent of A(2,2):  "numerical instability"
    L, U not the exact factors of a nearby problem
    Ex: Solve A*x = [1;2]; start by solving L*y = [1;2]
    so y(1) = fl(1/1) = 1
       y(2) = fl(2 - 1e8*1) = -1e8, (nearly) independently of b(2),
       so would get same solution for many different b(2): can't all have small error!
    Another way to see why bad: cond(L), cond(U) = 1e8, much bigger than cond(A)

    If we permute so A(1,1)=1, then full accuracy.
    Intuition: we want to pick a large entry of A to be pivot A11

  Error Analysis
    We want backward stability, ie A+E = P*L*U with E "small", ie norm(E)=O(macheps)*norm(A)
    Will see that we will need to avoid large values of L(j,i)*U(i,k), that are 
     much larger than original entries of A:
    Turns out that by looking at algorithm, each entry of U and L computed
    mostly as a dot product, which we analyzed in Q 1.10:

    track where U(j,k) comes from: (picture): for j <= k, we compute
      fl[ A(j,k) - L(j,1)*U(1,k) - L(j,2)*U(2,k) - ... - L(j,j-1)*U(j-1,k) ]
       = U(j,k)

     apply analysis of dot product (Q1.10) to get
     U(j,k) = (1+e)*[ A(j,k) - fl(sum_{i=1:j-1} L(j,i)*U(i,k)) ]
              where |e| <= macheps
            = (1+e)*[ A(j,k) - sum_{i=1:j-1} L(j,i)*U(i,k)*(1+d(i)) ]
         where |d_{j,k}| <= (j-1)*macheps
     Rewrite this as A(j,k) = exact dot product + (small) error:
       A(j,k) = U(j,k)/(1+e) + sum_{i=1:j-1} L(j,i)*U(i,k)*(1+d(i))
              ~ U(j,k)*(1-e) + sum_{i=1:j-1} L(j,i)*U(i,k)*(1+d(i))
              = U(j,k) + sum_{i=1:j-1} L(j,i)*U(i,k)
                -e*U(j,k) + sum_{i=1:j-1} L(j,i)*U(i,k)*d(i)
              = (L*U)(j,k) + E(j,k)
       where
       |E(j,k)| = |-e*U(j,k) + sum_{i=1:j-1} L(j,i)*U(i,k)*d(i)|
               <= [ |U(j,k)| + sum_{i=1:j-1} |L(j,i)*U(i,k)| ]*(j-1)*macheps
                = (|L|*|U|)(j,k) * n*macheps

   Similar analysis applies when j > k to get
      L(j,k) = fl[ ( A(j,k) - L(j,1)*U(1,k) - ... - L(j,k-1)*U(k-1,k) )/U(k,k) ]
               (1+e1)(1+e2)( A(j,k) - fl(sum_{i=1:k-1} L(j,i)*U(i,k)) )/U(k,k)
      solve for A(j,k) to get
      A(j,k) = L(j,k)*U(k,k)/((1+e1)*(1+e2)) + fl(sum_{i=1:k-1} L(j,i)*U(i,k) )
             = ...
             = (L*U)(j,k) + E(j,k)
      where
       |E(j,k)| <= (|L|*|U|)(j,k) * n*macheps  again
   
    Altogether:  A = L*U + E where |E| <= |L|*|U| * n*macheps componentwise

    Putting together whole solution (from Q1.11), since there are also
    error from triangular solves:

    Solving Ly=b => (L+dL)yhat = b     where |dL| <= n*macheps*|L|
    Solving Ux=y => (U+dU)xhat = yhat  where |dU| <= n*macheps*|U|
    
    Combine b = (L+dL)*yhat 
              = (L*dL)*(U+dU) * xhat 
              = (L*U + L*dU + dL*U + dL*dU) * xhat
              = (A - E + L*dU + dL*U + dL*dU ) * xhat
              = (A + dA)*xhat
    where |dA| <= |E| + |L*dU| + |dL*U| + |dL*dU|
               <= |E| + |L|*|dU| + |dL|*|U| + |dL|*|dU|
                = (3*n*macheps+O(macheps^2)) *|L|*|U| 
                ~ 3*n*macheps * |L|*|U|

    Is this "backward stable"? I.e. is norm(dA) = O(macheps)*norm(A)?
    Need to compare norm(|L|*|U|) with norm(A)

    Def: "Pivot growth factor" = g = norm(|L|*|U|)/norm(A)
    Fact:  g >= 1 (for most norms)

    Then ||dA|| <= 3*n*macheps*g* ||A||
    so we can compare xhat from (A+dA)*xhat=b with true solution x from A*x=b:
    Thm: || xhat - x ||/||x|| <= 3*n*macheps*g*kappa(A) + O(macheps^2)
    Proof: from before, we bound
         || xhat - x ||/||x|| <= kappa(A) * ||dA||/||A||

    Whether this is satisfactory depends on g:

    Bad example: A = [1e-8 1] from above:
                     [ 1   1]
     |L|*|U| = [    1   0 ] * [ 1e-8  1    ] = [ 1e-8   1 ]
               [  1e8   1 ]   [  0  |-1e8| ]   [ 1    2e8 ]
     and largest entry 2e8 is much bigger than A (g ~ 1e8): why we get the wrong answer

    Idea: Pick pivot A11 "large" so when we divide by it, entries of L are small:

    (1) Simplest, and standard, approach, used in most libraries: "Partial pivoting" (GEPP)
        Permute rows only so A11 largest available entry in column
        Then L21 = A21/A11 has |L21| <= 1
        Theorem (easy): with GEPP, |L| <= 1 and |U| <= 2^(n-1}
        Bad news: worst case is terrible; even for n=24 in singular precision, all wrong
        Good news: hardly ever happens (only very small family of matrices where this
                   occurs)
        Empirical observation, with some justification: gPP < n^(2/3)
            If all entris of matrix were "random", this would be true
            as you perform pivoting, they seem to get more random
    (2) Complete pivoting permute rows and columns so that A11 largest entry in whole matrix
        Theorem:     gCP < n^(log n /4)
        Empirical:  gCP < n^(1/2)
        Long-standing Conjecture: gCP < n (false, but nearly true)
        More expensive, hardly used, not in most libraries
    (3) Communication-avoiding pivoting - something new, introduce it when we
        talk about algorithms that minimize communication

    In practice, what if GEPP not good enough? Use QR instead (later)

    Finally, we note that 3*n*macheps*g*norm(A) is an upper bound, unlikely to
    be attained; see figures 2.1 and 2.2 for plots of actual backward error
    (which we know is ||E||/||A|| = ||A*xhat-b||/||A||*||xhat||) on random
    matrices; nearly always about macheps, i.e. doesn't grow with n.

    Estimate error bound:
     need to estimate norm(inv(A)), given L and U
     goals: cheap, accurate
     can't have both (Theorem of D., Diament, Malajovich), settle for cheap
     idea: use || inv(A) || =  max_{||x|| = 1} || inv(A)*x ||
                            =  max_{||x||<= 1} || inv(A)*x ||
     and do gradient ascent ("go uphill") on  || inv(A)*x ||
     on convex set ||x|| <= 1; 
     for right choice of norm easy to figure out ascent direction,
     each step requires solving Ax=b for some b, only costs O(n^2),
     usually takes at most 5 steps or so to stop ascending so
     total costs O(n^2); see text for details

  Where to find all this?
    Matlab: A\b  or [P,L,U]=lu(A) or condest
    LAPACK: xGETRF just for GEPP where x = S/D/C/Z
            xGESV to solve A*x=b
            xGESVX for condition estimation, more
            xGECON for condition estimation alone
    CLAPACK: similar
    ScaLAPACK: PxGESV, etc