Notes for Ma221 Lecture 26, Dec 1 2009

Continue Chapter 6 on iterative methods

Goals (for the rest of the semester)

    Recall CG
    How fast does CG converge? 
    How to make CG (or other Krylov subspace methods) converge faster 
       by preconditioning
    Solving the 2D Poission equation with the FFT, in O(N log N) time
    Solving the 2D Poission equation with Multigrid, in O(N) time

Recall CG

Conjugate Gradient Method for solving Ax=b:
Choose x to minimize ||b-A*x||_{A^(-1)} over all x in span{b,A*b,...,A^k*b}
   k= 0 , x_0 = 0, r_0 = b, p_1 = b
   repeat
     k = k+1
     z = A*p_k  ... get vector in larger Krylov subspace
     nu_k = r_(k-1)^T*r_(k-1) / p_k^T*z
     x_k  = x_(k-1) + p_k*nu_k ... update approximate solution
     r_k  = r_(k-1) - nu_k*z  ... update residuals
     mu_(k+1) = r_k^T*r_k / r_(k-1)^T*r_(k-1)
     p_(k+1)  = r_k + mu_(k+1) * p_k   ... update gradients (search directions)
   until || r_k ||_2 small enough

How fast does CG converge? 
      #iterations <= sqrt(condition number))
         example that we can prove (worst case):
         Thm: error(i) <= error(0)* (2/(1 + 2k/sqrt(cond - 1))) when cond large
           Proof sketch: since x_k = "best" solution in {b,A*b,...,A^(k-1)*b}
             we can write x_k = q_(k-1)(A)*b where q_(k-1)(x) is a degree k-1
             polynomial in x, and write the error as
               || x_k - x ||_A^2 = (q_(k-1)(A)*b - x)^T*A*(q_(k-1)(A)*b-x)
                                 = (q_(k-1)(A)*A*x - x)^T*A*(q_(k-1)(A)*A*x-x)
                   = x^T * (q_(k-1)(A)*A - I)^T*A*(q_(k-1)(A)*A-I) * x 
                   = x^T * A * (q_(k-1)(A)*A - I)^2 * x 
             Writing A = Q*Lambda*Q^T simplifies this to
                   = x^T*Q * Lambda * (q_(k-1)(Lambda)*Lambda - I)^2 * Q^T*x 
           <= x^T*Q * Lambda * max_i (q_(k-1)(lambda_i)*lambda_i - 1)^2 * Q^T*x 
             So the question becomes which degree k-1 polynomial q_(k-1)(x)
             minimizes |q_(k-1)(x)*x-1| over x = all eigenvalues of A,
             a classical question about polynomial approximations.
             The Chebyshev polynomial can be used to get an upper bound on
             how big this can be.

      This is a common property of all our iterative methods:
         high condition number => slow convergence

How to make CG converge faster by preconditioning
   Ex: A = [ 1 1e-10 ; 1e-10 1e-18] has condition number ~1e18
       However choosing D = diag(1, 1e9), and setting 
          A' = D*A*D = [ 1 1e-1 ; 1e-1 1 ]
       give us cond(A') ~ 1. So instead of solving A*x=b we could
       instead solve A'*x' = b' where x'=inv(D)*x and b' = D*b,
       which is much better conditioned, and then let x = D*x'.
       We can also think of this as choosing M = diag(A) and
        setting D = M^(-1/2)
       
    More generally: Choose some spd approximation M to A (not necessarily
      diagonal!) such that A' = M^(-1/2)*A*M^(-1/2) is much better conditioned
      than A, and run CG on A'. Here we mean that if M = Q*Lambda*Q^T
      is spd, we take M^(1/2) = Q*Lambda^(1/2)*Q^T, which is also spd,
      and note that A' is spd, so that we are allowed to apply CG to it.

    To make this practical, we prefer to avoid needing M^(-1/2), and instead
    reorganize CG to multiply by M^(-1) and A. In particular, we think of 
    solving M^(-1)*A*x = M^(-1)*b, and note that M^(-1)*A and
        M^(1/2)*(M^(-1)*A)*M^(-1/2) = M^(-1/2)*A*M^(-1/2)
    are similar, and so have the same eigenvalues.
    However, M^(-1)*A is not symmetric in general, so we can't just do CG
    with A replaced by M^(-1)*A. and b by M^(-1)*b.

    Procedure: write down CG for A'*x'=b', 
         with A' = M^(-1/2)*A*M^(-1/2)
              x' = M^(1/2)*x
              b' = M^(-1/2)*b
              r' = b'-A'*x' = M^(-1/2)*(b-A*x) = M^(-1/2)*r
    and define new variables x, b and r
    by "absorbing" M^(1/2) and M^(-1/2} factors into variables with primes
    (like x') so that we multiplications by A and M^(-1) remain.

Preconditioned Conjugate Gradient Method for solving Ax=b:
   ... write Mi = M^(-1), Mh = M^(1/2) and Mhi= M^(-1/2) for short
   k= 0 , x'_0 = 0, r'_0 = b' = Mhi*b, p'_1 = b' = Mhi*b
   ... want to compute x_k = Mhi*x'_k  and r_k = Mh*r'_k
   repeat
     k = k+1

     z' = A' * p'_k = Mhi*A*Mhi*p'_k  or
       Mh*z' = A*Mhi*p'_k  or
(1)    z = A * p_k   
          where p_k = Mhi*p'_k and z = Mh*z'

     nu'_k = r'_(k-1)^T*r'_(k-1) / p'_k^T*z' or
       nu'_k = (Mhi*r_(k-1))^T*(Mhi*r_(k-1)) / (Mh*p_k)^T*(Mhi*z) 
          where r_k = Mh*r'_k   or
       nu'_k = r_(k-1))^T*Mi*r_(k-1) / p_k^T*z  or
(2)    nu'_k = r_(k-1))^T*y_(k-1) / p_k^T*z  
          where y_(k-1) = Mi*r_(k-1)  

     x'_k  = x'_(k-1) + p'_k * nu'_k or
       Mhi * x'_k  = Mhi * x'_(k-1) + Mhi * p'_k * nu'_k or
(3)    x_k = x_(k-1) + p_k * nu'_k  
          where x_k = Mhi*x'_k

     r'_k  = r'_(k-1) - z' * nu'_k or
       Mh*r'_k  = Mh*r'_(k-1) - Mh*z' * nu'_k or
(4)    r_k = r_(k-1) - z * nu'_k 

     mu'_(k+1) = r'_k^T*r'_k / r'_(k-1)^T*r'_(k-1) or
       mu'_(k+1) = r_k^T*Mi*r_k / r_(k-1)^T*Mi*r_(k-1) or
(6)    mu'_(k+1) = r_k^T*y_k / r_(k-1)^T*y_(k-1) 
(5)       where y_k = Mi*r_k

     p'_(k+1)  = r'_k + p'_k * mu'_(k+1) or
       Mhi*p'_(k+1) = Mhi*r'_k + Mhi*p'_k * mu'_(k+1) or
       p_(k+1) = Mhi*Mhi*r_k + p_k * mu'_(k+1) or
       p_(k+1) = Mi*r_k + p_k * mu'_(k+1) or
(7)    p_(k+1) = y_k + p_k * mu'_(k+1) 

   until || r_k ||_2 small enough or

How is M chosen in practice?
   (1) If diag(A) has entries varying widely in magnitude, M = diag(A) helps,
       and is called a Jacobi preconditioner, by analogy to Jacobi iteration
   (2) If A has square diagonal blocks A_ii that are easy to solve, then 
       M = diag(A_11,A_22,...) also works, and is called a block Jacobi
       preconditioner. A_ii can be easy because it is small, or because
       we can exploit its structure...
   (3) Domain decomposition: We have been talking about how to solve the
       model problem (2D Poisson) on a simple region like a rectangle,
       but suppose we need to solve it on a more complicated
       region. For illustration suppose we have two rectangles glued
       together into an L-shaped region, and number the vertices
       in one rectangle before the vertices in the other (see page 349
       of the book). Then the diagonal blocks A_ii of A will be model
       problems on rectangles, and we will be able to use fast solvers
       for these (like FFT and multigrid, discussed later).
   (4) Instead of Jacobi, we can let an application of M^(-1) be
       one application of Symmetric SOR(w), (run SOR(w) twice,
       with variables in reverse order, to make M symmetric)
   (5) We can do "incomplete Cholesky" A ~ Li*Li^T = M, where Li is sparser
       than the true L, and so cheaper to compute (eg no fill allowed),
       so M^(-1) is solving with Li and Li^T

Using FFTs for solving for the Poisson equation
  This works for Poisson in any number of dimensions, we will illustrate
  it on 2D case. In all cases the complexity is O(dimension*log(dimension)),
  very close to the lower bound of O(dimension).
  Recall the famous N-by-N matrix F_jk = 1/sqrt(N) * exp(-2*pi*i*j*k/N) 
     with i = sqrt(-1), and 0 <= j,k <= N-1
  Thm (Cooley/Tukey): It is possible to multiple F*x in O(N*log(N)) flops using
    the "Fast Fourier Transform"
  Now recall T_N = V_N*Lambda_N*V_N^T where
    V_N,jk = sqrt(2/(N+1))*sin(j*k*pi/(N+1))
  Comparing V_N and F_N we see that V_N is a subset of the imaginary part 
   of F_(2*N+2), and so it is possible to multiply any vector times V_N
  in O(N*log(N)) flops as well, using an analogous algorithm called th
  Fast Sine Transform

  Now write the N^2-by-N^2 2D Poisson equation as follows:
      T*X + X*T = F
  where X is the unknown N-by-N matrix to solve for. Rewrite as
     (V*Lambda*V^T)*X + X*(V*Lambda*V^T) = F
  or
    V^T * [  (V*Lambda*V^T)*X + X*(V*Lambda*V^T) = F ] * V
  or
     Lambda*(V^T*X*V) + (V^T*X*V)*Lambda = V^T*F*V
  or
     Lambda*X_V + X_V*Lambda = F_V 
  This is a diagonal system of equations in disguise, with
     lambda_i * X_V,jk + X_V,jk * lambda_j = F_V,jk
  or
(*)  X_V,jk = F_V,jk / (lambda_j + lambda_k)
  with lambda_j = 2*(1-cos(pi*j/(N+1))

  So here is the overall algorithm:
     (1) Compute F_V = V^T*F*V using FFTs:
             V^T*F requires multipling each column of F by V^T=V, an FFT
             (V^T*F)*V requires multipling each row of F by V^T=V, an FFT
         So the total cost is 2*N FFTs, or O(N^2 log N)
     (2) Solve for X_V from F_v using (*), at a cost of O(N^2)
     (3) Compute F = V*F_V*V^T using FFTs as in step (1)
    
Note: Just like matrix multiplication and other linear algorithm operations,
the FFT, Fast Sine Transform, and related transforms have been highly tuned
for many architectures, see www.fftw.org and www.spiral.net.