Notes for Ma221 Lecture 22, Nov 9 2010

Begin Chapter 6 on iterative methods

Goals: 
    Contrast direct and iterative methods
       For Ax=b or least squares: use iterative methods when direct methods 
         too slow, or use too much memory (from fill-in), or don't need as
         much accuracy
       For eigenproblems or SVD: use iterative methods for above reasons, 
         or when only a few eigenvalues/vectors are desired
    A good iterative method exploits knowledge of the underlying
    physical or mathematical model that gives rise to matrix
       Large diversity of methods, no one best
       Model problem: Poisson equation in 2D or 3D
          arises in electrostatics, heat flow, quantum mechanics, 
            incompressible fluid mechanics, ...
    Large diversity of software tools available
       see web page
    Summary of iterative methods as applied to Model problem
       From simple (slow) to sophisticated (fast)
         (1) simple ones apply to more problems than Model problem
         (2) simple ones are building blocks for more sophisticated ones
    Methods covered:
       Jacobi
       Gauss-Seidel
       Successive Overrelaxation (SOR)
       Krylov Subspace Methods  (KSMs)
          Conjugate Gradients, GMRES,...
          preconditioning
       FFT
       Multigrid
       Domain decomposition (hybrids of above)
       Time permitting, eigenvalue algorithms (Chap 7)
    Optimizing performance
       Most methods assume only thing you can afford to do is y=A*x
          SpMV = Sparse-Matrix-Vector multiplication
          KSMs only need "black box" subroutine for y=A*x for user's A
       Parallelizing SpMV
       Communication-avoiding methods
          reorganize methods to take k steps for each access of A, instead of 1 step
          possible class projects 
       
Poisson's Equation
   1D with Dirichlet boundary conditions
        -d^2 v(x)/d x^2 = f(x) on (0,1) with v(0)=v(1)=0
      discretize with centered differences to get T_N
         h = 1/(N+1), x_i = i*h, v_i = v(i*h), f_i = f(i*h),  
         d v(x)/dx at x=(i-.5)*h ~ (v_i - v_i-1)/h
         d v(x)/dx at x=(i+.5)*h ~ (v_i+1 - v_i)/h
         d^2 v(x)/d x^2 at x = i*h ~ ((v_i+1 - v_i)/h - (v_i - v_i-1)/h )/h
                                   = ( v_i+1 - 2*v_i + v_i-1 ) / h^2
            with truncation error tau_i = O(h^2)
         so ODE becomes
              -v_i+1 + 2*v_i - v_i-1 = h^2 (f_i + tau_i)
                                     = h^2*f_i + O(h^4)
              for i=1 to N, with v_0 = v_N+1 = 0
         or
              T_N * [v_1;...;v_N] =  T_N * v = h^2*[f_1;...;f_N] = h^2*f
         where T_N is tridiagonal with diagonals = 2 and offdiagonals = -1.
         This is often referred to as a "stencil" on a 1D mesh. (draw)

   Eigenvalues and eigenvectors of T_N
      Lemma: T_N * z_j = lambda_j * z_j where
             lambda_j = 2*(1 - cos(pi*j/(N+1))
             z_j(k) = sqrt(2/(N+1)) * sin(j*k*pi/(N+1)) 
             and ||z_j||_2 = 1
      Proof: Trigonometry (homework!)
      Corollary: The matrix Z where Z_jk = sqrt(2/(N+1))*sin(j*k*pi/(N+1))
        is orthogonal (related to FFT matrix later)

      (Fig 6.1 in text (graph711.eps) for eigenvalues, 
       Fig 6.2 in text (graph712.eps) for eigenvectors)

      Eigenvalues range from smallest lambda_j ~ (pi*j/(N+1))^2 for small j
       to largest lambda_N ~ 4

      This tell us the condition number of T_N, 
          lambda_N/lambda_1 ~ (2(N+1)/pi)^2 = (2/pi)^2*h^(-2)
      and let us bound the error in the solution of the discrete 1D Poisson Eq:
          || T_N^{-1}*h^2*tau || = O((N+1)^2*h^4) = O(h^2)

      Let us compare the eigenvalues and eigenvectors of T_N with the
      eigenvalues and eigenfunctions of the differential equation:
         -d^2 v'(x) / d^2 x = lambda'*v'(x) with v'(0) = v'(1) = 0
      We see that v'_j(x) must be a linear combination of sin(sqrt(lambda)*x)
      and cos(sqrt(lambda)*x). v'(0) = 0 means it is a multiple just of
      sin(sqrt(lambda)*x), and v'(1)=0 means sqrt(lambda) = j*pi,
      or lambda'_j = (j*pi)^2 and v'_j(x) is a multiple (say 1) of sin(j*pi*x).

      Thus the eigenvectors v_i of T_N are exactly equal to the scaled 
      eigenvectors v'(i*h) of the ODE, sampled at x_i = i*h, and when j is small, 
      the eigenvalues lambda'_j =(j*pi)^2 are well-approximated by h^(-2)*lambda_j.

   2D with Dirichlet boundary conditions
       -d^2 v(x,y) / dx^2 - d^2 v(x,y) / dy^2 = f(x,y) on square (0,1)^2,
         with v(x,y)=0 on boundary of square
      Discretizing as before we get v_ij ~ v(i*h,j*h) etc, and
        d^2 v(x,y) / dx^2 ~ ( v_i-1,j - 2*v_i,j + v_i+1,j ) / h^2
        d^2 v(x,y) / dy^2 ~ ( v_i,j-1 - 2*v_i,j + v_i,j+1 ) / h^2
      and adding these and multiplying by -h^2
        4*v_ij - v_i-1,j - v_i+1,j - v_i,j-1 - v_i,j+1 = h^2*f_ij
      This is often referred to as a "stencil" on a 2D mesh. (draw)
      Letting V be the NxN matrix of v_ij, we can also write
        2*v_ij - v_i-1,j - v_i+1,j  = (T_N*V)_ij
        2*v_ij - v_i,j-1 - v_i,j+1  = (V*T_N)_ij
      or
        4*v_ij - v_i-1,j - v_i+1,j - v_i,j-1 - v_i,j+1 = (T_N*V + V*T_N)_ij
      and
        T_N*V + V*T_N = h^2*F 
      a system of N^2 linear equations for the N^2 entries of V, though
      not in the usual "A*x=b" form.

      We can ask about the corresponding eigenvalues and eigenvectors
      in the analogous form
        T_N*V + V*T_N = lambda*V
      Suppose that T_N*z_i = lambda_i*z_i and T_N*z_j = lambda_j*z_j
      are any two eigenpairs of T_N. Letting V = z_i * z_j^T , we get
        T_N*V + V*T_N = T_N*z_i*z_j^T + z_i*z_j^T*T_N
                      = (T_N*z_i)*z_j^T + z_i*(z_j^T*T_N)
                      = (lambda_i*z_i)*z_j^T + z_i*(z_j^T*lambda_j)
                      = (lambda_i + lambda_j)*(z_i*z_j^T)
                      = ( lambda_i + lambda_j ) * V
      so that V is an "eigenvector". Since there are N^2 unknowns,
      we expect N^2 eigenpairs, for all possible sums of pairs lambda_i + lambda_j.
          
      Comparing V with the eigenvalues and eigenfunctions of the 2D PDE,
       we see that (homework!)
          v_ij(x,y) = sin(i*pi*x) * sin(j*pi*y)
       is an eigenfunction with eigenvalue
         i^2*pi^2 + j^2*pi^2 = (i^2+j^2)*pi^2
      so like the discrete case, we just multiply all pairs of eigenvectors
      component-by-component, and add the corresponding eigenvalues.

      (Fig 6.3 in text (2DPoissonEV1.eps, 2DPoissonEV2.eps for eigenvectors
       and contour plots)

      We want to write V as a single vector in a way that easily extends
      to higher dimensional Poisson equations:
       In the 3x3 case, write V columnwise from left to write as:
          v = [ v_11; v_21; v_31; v_12; v_22; v_32; v_13; v_23; v_33 ]
       and T_NxN as
         [  4 -1  0  -1                 ]
         [ -1  4 -1     -1              ]
         [  0 -1  4        -1           ]
         [ -1         4 -1  0  -1       ]
         [    -1     -1  4 -1     -1    ]
         [       -1   0 -1  4        -1 ]
         [           -1         4 -1  0 ]
         [              -1     -1  4 -1 ]
         [                 -1   0 -1  4 ]
       The pattern is that in each row, the 
           4  on the diagonal multiplies v_ij,
          -1s next to the diagonal multiply v_i-1,j and v_i+1,j
          -1s N away from the diagonal multiply v_i,j-1 and v_i,j+1
       Note that some rows have fewer than 4 -1s, when these are
       on the boundary of the square. One may confirm that
         T_NxN * v = h^2*f
       Another way to write T_NxN is
         T_NxN = [ T_N + 2*I_N   -I_N                  ]
                 [     -I_N    T_N + 2*I_N    -I_N     ]
                 [                 ...                 ]
                 [                 -I_N    T_N + 2*I_N ]
       a pattern we will generalize to arbitrary dimensions,
       using Kronecker products

  Poissons's equation in d dimensions
    We need to convert arrays most naturally indexed by 2 or more indices
    (like V above) into vectors.
 
    Def: Let X be mxn, then vec(X) is the m*n x 1 vector gotten by stacking
         the columns of X atop one another, from left to right.
         (In MatLab, vec(X) = reshape(X,m*n,1))

    Def: Let A be mxn and B be pxq, then A kron B is the m*p x n*q matrix
           [ A_11*B  A_12*B  ... A_1n*B ]
           [ A_21*B  A_22*B  ... A_2n*B ]
           [                 ...        ]
           [ A_m1*B  A_m2*B  ... A_mn*B ]
         (in Matlab, A kron B = kron(A,B))

    Lemma: Let A be mxm,  B be nxn and X and C be mxn, then
       1. vec(A*X) = (I_n kron A) * vec(X)
       2. vec(X*B) = (B^T kron I_m) * vec(X)
       3. The 2D Poisson equation T_N*V + V*T_N = h^2*F may be written
            (I_N kron T_N   +   T_N kron I_N) * vec(V) = h^2*vec(F)
     Proof of 1: I_n kron A = diag(A,A,...,A)  n times  so (I_n kron A) * vec(X)
                 = diag(A,A,...,A)*[X(:,1); X(:,2); ... ; X(:,n)]
                 = [ A*X(:,1); A*X(:,2); ... ; A*X(:,n) ]
                 = vec(A*X)
     Proof of 2: similar (homework!)
     Proof of 3: Apply part 1 to T_N*V and part 2 to V*T_N, noting the T_N is symmetric

    Note that 
      I_N kron T_N  +  T_N kron I_N  =
         [ T_N             ] + [ 2*I_N   -I_N               ]
         [     T_N         ]   [  -I_N  2*I_N   -I_N        ]
         [         ...     ]   [                 ...        ]
         [             T_N ]   [                -I_N  2*I_N ] 
    In Matlab, this is two lines:
      TN = 2*eye(N) - diag(ones(N-1,1),-1) - diag(ones(N-1,1),+1)
      TNN = kron(eye(N),TN) + kron(TN,eye(N))
 
    Lemma (homework!)
      (1) Assume the A*C and B*D are well defined. 
          Then (A kron B)*(C kron D) = (A*C) kron (B*D)
      (2) If A and B are invertible then (A kron B)^(-1) = A^(-1) kron B^(-1)
      (3) (A kron B)^T = A^T kron B^T

    Prop: Let T = Z*Lambda*Z^T be the eigendecomposition 
          of the NxN symmetric matrix T.  Then the eigendecomposition of 
            I kron T + T kron I 
               = (Z kron Z) * (I kron Lambda + Lambda kron I) * (Z^T kron Z^T)
          here I kron Lambda + Lambda kron I is a diagonal matrix whose 
          (i*N+j)th entry is lambda_i + lambda_j, and 
          Z kron Z is orthogonal with (i*N+j)th column z_i kron z_j.

      Proof: Multiply out the factorization using the last lemma to get
            (Z kron Z) * (I kron Lambda + Lambda kron I) * (Z^T kron Z^T) 
               = (Z*I kron Z*Lambda + Z*Lambda kron Z*I) * ( Z^T kron Z^T) 
               = (Z*I*Z^T kron Z*Lambda*Z^T + Z*Lambda*Z^T kron Z*I*Z^T) 
               = (I kron T + T kron I)  as desired

    Similarly, Poisson's equation in 3D leads to the matrix
      T_NxNxN = ( T_N kron I_N kron I_N ) +
                ( I_N kron T_N kron I_N ) +
                ( I_N kron I_N kron T_N ) 
    with eigenvalue matrix 
                ( Lambda_N kron I_N     kron I_N      ) +
                ( I_N      kron Lambda_N kron I_N     ) +
                ( I_N      kron I_N     kron Lambda_N ) 
    i.e. N^3 eigenvalues lambda_i + lambda_j + lambda_k for all triples (i,j,k),
    and corresponding eigenvector matrix Z kron Z kron Z.
    Poisson's equation equation in d dimensions is similarly represented.
      
Summary of performance of iterative methods as applied to 
Discrete Poisson equation in 2D (and 3D).
"Parallel Steps" is on a machine with as many processors as
you need (#Procs), and so refers only to the available parallelism,
not how you would actually implement it on a real machine
with a smaller number of processors.

Some explanations and abbreviations:
Explicit inverse assumes we have precomputed
  the exact inverse of the matrix (and are not
  counting the cost of doing this!) and only
  need to multiply by it
SOR = Successive overrelaxation
SSOR/Chebyshev = Symmetric SOR with Chebyshev Acceleration
FFT = Fast Fourier Transform
BCR = Block Cyclic Reduction
Lower Bound is a lower bound that assumes one flop per
  component of the output vector

For 2D Poisson on an nxn mesh,   we let N = n^2 = #unknowns
For 3D Poisson on an nxnxn mesh, we let N = n^3 = #unknowns
If the table entry for 3D Poisson differs from 2D, it is shown
   in parentheses.
All table entries are meant in a Big-Oh sense.
#Parallel steps is the fewest possible, given the largest useful #Procs
  (so #Procs is much larger than what you would have in practice,
   and is just a theoretical limit on how much parallelism there
   is in the problem).

Method                Direct or   #Flops    Mem       #Parallel #Procs
                      Iterative                       Steps
Dense Cholesky            D       N^3       N^2       N         N^2
   works on any spd matrix
-------------------------------------------------------------------------
Explicit Inverse          D       N^2       N^2       log N     N^2
-------------------------------------------------------------------------
Band Cholesky             D       N^2       N^(3/2)   N         N
                                  (N^(7/3)) (N^(5/3))           (N^(4/3))
   Works on any spd band matrix with cost O(N*bw^2),
       2D: bw = n = N^(1/2)
       3D: bw = n^2 = N^(2/3)
-------------------------------------------------------------------------
Jacobi                    I       N^2       N         N         N
              3D case: Homework!  (N^(5/3)            (N^(2/3))
   #Flops = O(#Flops(SpMV)* #iterations), where #Flops(SpMV) = O(#nonzeros in A)
     and #iterations ~ cond(A) = n^2 in 2D and 3D
   Works on any diagonally dominant matrix, convergence rate depends on cond(A)
-------------------------------------------------------------------------
Gauss-Seidel              I       N^2       N         N         N
              3D case: Homework!  (N^(5/3)            (N^(2/3))
   cost analogous to Jacobi (constants differ)
   Works on any diagonally dominant or spd matrix
-------------------------------------------------------------------------
Sparse Cholesky           D       N^(3/2)   N*log N   N^(1/2)   N
                                  (N^2)     (N^(4/3)) (N^(2/3)) (N^(4/3))
   Assumes we are using the best ordering known (nested dissection)
   Works on any spd matrix; complexity arises from cost of dense Cholesky
     on trailing nxn submatrix (in 2D) or n^2xn^2 submatrix (in 3D)
-------------------------------------------------------------------------
Conjugate Gradients       D       N^(3/2)   N         N^(1/2)*log N    N
              3D case: Homework!  (N^(4/3))           (N^(1/3))*log N) N
   #Flops =O(#Flops(SpMV)* #iterations), where cost(SpMV) = O(#nonzeros in A)
     and #iterations = O(sqrt(cond(A)))
   Works on any spd matrix
-------------------------------------------------------------------------
SOR                       I       N^(3/2)   N         N^(1/2)   N
              3D case: Homework!  (N^(4/3))           (N^(4/3)) N
   SOR = Successive Overrelaxation
   Works on any spd matrix, convergence rate depends on cond
-------------------------------------------------------------------------
SSOR/Chebyshev            I       N^(5/4)   N         N^(1/4)   N
              3D case: Homework!  (N^(7/6))           (N^(1/6))
   #Flops = O(#Flops(SpMV) * #iterations), where #flops(SpMV) = O(#nonzeros in A)
     and #iterations = (cond(A))^(1/4)
   Works on any spd matrix, but need to know range of eigenvalues
-----------------------------------------------------------------------
FFT                       D       N*log N   N         log N     N
   Works on model problem
-------------------------------------------------------------------------
BCR                       D       N*log N   N         ?         ?
   BCR = Block Cyclic Reduction
              3D case: Homework!
   Works on problems slightly more general than model problem
-------------------------------------------------------------------------
Multigrid                 I       N         N         log^2 N   N
   Many variants of this algorithm, which work quite generally on
    problems arising from elliptic PDE and FEM models
-----------------------------------------------------------------------
Lower bound                       N         N         log N
-----------------------------------------------------------------------


Basic iterative methods: given an initial guess x_0 for a solution
of A*x=b, cheaply compute a sequence x_i converging to A^(-1)*b.

Def: A splitting of A is a decomposition A = M - K with M nonsingular

This yields the following iterative method: Write A*x = M*x - K*x = b,
so M*x = K*x + b. Then compute x_(i+1) from x_i by solving M*x_{i+1} = K*x_i + b,
or x_{i+1} = M^(-1)*K*x_i + M^(-1)*b = R*x_i + c. 
  (*)  x_{i+1} = R*x_i + c
For this to work well, we need two conditions:
(1) it should converge
(2) multiplying by R (i.e. solving M*x_(i+1) = K*x_i+c for x_(i+1)) and 
    computing c = M^{-1}*b should be much cheaper than solving with A itself.

Lemma: Let ||.|| be any operator norm. Then if ||R||<1, (*) converges
to A^(-1)*b for any x_0.
Proof: Subtract x = R*x + c from x_{i+1} = R*x_i+c to get
x_(i+1) - x = R*(x_i-x) or 
  ||x_(i+1)-x|| <= ||R||*||x_i-x|| <= ... <= ||R||^(i+1)*||x_0-x||
which converges to zero for any x_0 if ||R||<1.

Def: The spectral radius of R is rho(R) = max |lambda|, the largest absolute
value of any eigenvalue of R

Lemma: For all operator norms, rho(R) <= ||R||. For all R and all eps>0,
there exists an operator norm ||.||_* such that ||R||_* <= rho(R) + eps

Proof: To show rho(R) <= ||R||, note that ||R|| = max_x ||R*x||/||x|| 
By picking x to be an eigenvector, we see that ||R|| >= |lambda| for any
eigenvalue |lambda| or R. To construct ||.||_*, we use the Jordan Form of R:
Let S^{-1}*R*S = J be in Jordan form, i.e. J is block diagonal with Jordan
blocks. Let D = diag(1,eps,eps^2,...,eps^(n-1)). Then
J_eps = D^{-1}*J*D leaves the diagonal (eigenvalues) unchanged, and multiplies
each superdiagonal entry of J by eps, so J has eigenvalues on the diagonal,
and eps above the diagonal in any Jordan blocks. We now define a new vector
norm by ||x||_* = ||(S*D)^{-1}*x||_inf. (See Question 1.5 for the proof that
this defines a vector norm.) Then
  ||R||_* = max_x ||R*x||_* / ||x||_*
          = max_x ||(S*D)^(-1)*R*x||_inf / ||(S*D)^(-1)*x||_inf
          = max_y ||(S*D)^(-1)*R*(S*D)*y||_inf / ||y||_inf
                       where y = (S*D)^(-1)*x
          = max_y ||D^(-1)*S^(-1)*R*S*D*y||_inf / ||y||_inf
          = max_y ||J_eps*y||_inf / ||y||_inf
          <= max |lambda| + eps
          = rho(R) + eps

Theorem: The iteration x_{i+1} = R*x_i + c converges to the solution of A*x=b
for all starting vectors x_0 if and only if rho(R)<1.

Proof: If rho(R) >= 1, chose x_0 so that x_0-x is an eigenvector of R for
its largest eigenvalue, call it lambda. Then x_i-x = R^i*(x_0-x) = lambda^i*(x_0-x)
does not converge to zero since |lambda| >= 1.
If rho(R) < 1, use the last lemma to conclude that there is an operator norm ||R||_*
and an eps such that ||R||_* <= rho(R)+eps < 1, so that
x_i-x = R^i*(x_0-x) converges to zero for all x_0.

Obviously, the smaller is rho(R), the faster is convergence. Our goals is to pick the
splitting A = M-K so that multiplying by R = inv(M)*K is easy, and 
rho(R) is small. Choosing M = I and K = I-A makes multiplying by R easy, but
may not generally make rho(R) small. Choosing K = 0 and so R = 0 makes convergence 
immediate, but requires having the solution c = A^{-1}*b.

Now we can describe Jacobi, Gauss-Seidel (GS) and Successive Overrelaxation (SOR).
All share the notation A = D - L' - U' = D*(I - L - U), where D is the diagonal of A,
L' is the negative of the strictly lower triangular part, and U' is the negative of
the strictly upper triangular part of A.

Jacobi: 
   In words, for j = 1 to n, pick x_(i+1)(j) to exactly satisfy equation j
   As a loop:
       for j = 1 to n, x_(i+1)(j) = (b_j - sum_{k neq j} A(j,k)*x_i(k))/A(j,j)
   As a splitting: A = D - (L'+U') = M - K, so 
       R_J = M^{-1}*K = D^{-1}*(L'+U') = L+U
   For the 2D Poisson equation T_N*V+V*T_N = h^2*F, to get from V_i to V_(i+1):
       for j = 1 to n, for k = 1 to n
               V_(i+1)(j,k) = (V_i(j-1,k) + V_i(j+1,k) + V_i(j,k-1) + V_i(j,k+1) 
                              + h^2*F(j,k))/4
                            = "average" of 4 nearest neighbors and right-hand-side

Gauss-Seidel:
    In words, improve on Jacobi by using most recently updated values of solution
    As a loop:
       for j = 1 to n, x_(i+1)(j) = (b_j - sum_{k < j} A(j,k)*x_(i+1)(k)
                                                   ... updated x components
                                         - sum_{k > j} A(j,k)*x_i(k) ) / A(j,j)
                                                   ... older x components
    As a splitting: A = (D-L') - U', so 
      R_GS = (D-L')^{-1}*U' = (D*(I-D^{-1}*L'))^{-1}*U' = (I-L)^{-1}*U
    Note that the order in which we update entries matters in GS (unlike Jacobi).
    For 2D Poisson there are two orders to consider:
      natural order (rowwise or columnwise update of V(i,j))
      Red-Black (RB) ordering: color the entries in the 2D mesh like a checkerboard,
        and number the Reds before all the Blacks. Note that each Red node only has
        Black neighbors, and vice-versa. Thus when updating Red nodes, we can
        update them in any order, since all the Black neighbors have old data.
        Then when we update the Black nodes, we can update them in any order,
        because all the Red neighbors have new data.
     For all (j,k) nodes that are Red (j+k even)
               V_(i+1)(j,k) = (V_i(j-1,k) + V_i(j+1,k) + V_i(j,k-1) + V_i(j,k+1) 
                              + h^2*F(j,k))/4 ... only use old data
     For all (j,k) nodes that are Black (j+k odd)
               V_(i+1)(j,k) = (V_(i+1)(j-1,k) + V_(i+1)(j+1,k) + V_(i+1)(j,k-1) 
                               + V_(i+1)(j,k+1) 
                               + h^2*F(j,k))/4 ... only use new data

SOR: In words: This depends on a parameter w: We let the result of SOR be
       a weighted combination of the old (x_i) and new (x_(i+1)) solutions
       that GS would have computed:
          x^SOR(w)_(i+1)(j) = (1-w)*x^GS_i(j) + w*x^GS_(i+1)(j)
       When w=1, SOR(1) is just GS. When w<1 we would call this underrelaxation,
       and when w>1, we call it overrelaxation. We prefer to "overrelax" with the
       intuition that if moving in the direction from x_i to x_(i+1) was a good idea,
       moving even farther in the same direction is better.
       Later we will show how to pick w optimally for the model problem.
     As a loop:
       for j = 1 to n, x_(i+1)(j) = (1-w)*x_i(j) + 
                                    w*(b_j - sum_{k < j} A(j,k)*x_(i+1)(k)
                                                   ... updated x components
                                           - sum_{k > j} A(j,k)*x_i(k) ) / A(j,j)
                                                   ... older x components
     As a splitting multiply through by D to get
         (D-w*L')*x_(i+1) = ((1-w)*D + w*U')*x_i + w*b   
     and then divide by w to get the splitting 
         A = (D/w - L') - (D/w-D + U') 
     or
         R_SOW(w) = (D/w - L')^(-1) * (D/w - D + U') = (I-w*L)^(-1)*((1-w)*I + w*U)
     For 2D Poisson with Red-Black Ordering:
        For all (j,k) nodes that are Red
               V_(i+1)(j,k) =  (1-w)*V_i(j,k) + w*
                              (V_i(j-1,k) + V_i(j+1,k) + V_i(j,k-1) + V_i(j,k+1) 
                              + h^2*F(j,k))/4 ... only use old data
        For all (j,k) nodes that are Black
               V_(i+1)(j,k) = (1-w)*V_i(j,k) + w*
                              (V_(i+1)(j-1,k) + V_(i+1)(j+1,k) + V_(i+1)(j,k-1) 
                              + V_(i+1)(j,k+1) 
                              + h^2*F(j,k))/4 ... use new data

Now we analyze the convergence of these basic methods, in general and for the
Model Problem (2D Poisson) in particular.

Jacobi's method for the Model problem is easy to analyze, since in the splitting
T_nxn = M - K = 4*I - (4*I - T_nxn), so R = M^(-1)*K = I - T_nxn/4,
so the eigenvalues of R are 1 - (lambda_i+lambda_j)/4 for all pairs of 
eigenvalues lambda_i = 2*(1-cos(i*pi/(n+1))) and lambda_j of T_n,
and so the spectral radius rho(R) of R is
   1 - lambda_min/2 = 1 - (1 - cos(pi/(n+1))) = cos(pi/(n+1) ~ 1 - pi^2/2(n+1)^2
which gets closer to 1 as n grows, i.e. Jacobi converges more slowly.

Since the error after m steps is multiplied by rho(R)^m, we estimate 
the speed of convergence by computing how many steps m are needed
to reduce the error by a constant factor. Setting rho(R) = 1 - x
and solving (1-x)^m = exp(-1) for simplicity, we get
   (1 - x)^m = (1-x)^[(1/x)*m*x] ~ exp(-m*x) = exp(-1)
or m = 1/x = 2*(n+1)^2/pi^2 = O(n^2) = O(N) steps, where N=n^2 is the 
number of unknowns. Note that cond(T_nxn)~4*N/pi^2 is also O(N).
So the number of iterations grows proportionally to the dimension N,
and to the condition number. So to reduce the error by any constant
factor costs O(N) iterations * O(N) flops_per_iteration, or O(N^2)
overall, explaining the entry in the table above. This is typical
for many iterative methods: the number of iterations grows with the 
condition number (multigrid is an important exception!).

Later we will show that, provided the variables are updated in the
appropriate order, rho(R_GS) = rho(R_J)^2, i.e. one step of Gauss-Seidel 
reduces the error as much as two Jacobi steps; this is better but
the overall complexity O(N^2) is the same.

Then, again with the right update order, and the right choice of
overrelaxation parameter w, SOR(w) is much faster with
     rho(R_SOR(w_opt)) ~ 1 - 2*pi/n
for large n, which is close to 1 but much farther away than rho(R_J),
and takes only O(n) = O(sqrt(N)) steps to reduce the error by a constant
factor, for an overall cost of O(N^(3/2)) as opposed to O(N^2).

Now we present (and prove some of!) the general theory of these
three methods applied to Ax=b.

Thm 1: If A is strictly row diagonally dominant:
          |A_ii| > sum_{j neq i} |A_ij|
       Then both Jacobi and GS converge, and GS is at least as fast
       as Jacobi in the sense that ||R_GS||_inf <= ||R_J||_inf < 1

The model problem is not strictly row diagonally dominant,
since the diagonal equals the negative sum of the off diagonals
in most rows. We call a matrix where
         |A_ii| >= sum_{j neq i} |A_ij|
in all rows, with strict inequality at least once, weakly row diagonally
dominant. This alone is not enough for convergence: consider 
A = [ 1 -1 0;  1 1 0; 0 0 1] and 
R = [ 0  1 0; -1 0 0; 0 0 0], whose powers do not converge.
So we need one more condition: 

Def: A matrix is A is irreducible if there is no permutation P
such that P*A*P^T is block triangular. Equivalently, the
graph corresponding to entries of A is "strongly connected",
i.e. there is a path from every vertex to every other vertex.

The model problem satisfies this definition, since a mesh is
strongly connected, and except for the vertices next to a boundary of
the mesh, |A_ii| > sum_{j neq i} |A_ij|

Thm 2: If A is weakly diagonally dominant and irreducible (like
the model problem) then rho(R_GS) < rho(R_J) < 1, so both
Jacobi and GS converge, and GS is faster.

Now we come to SOR(w)

Thm 3: If A is spd, then SOR(w) converges if and only if 0 < w < 2.
In particular SOR(1) = GS converges.

To analyze the convergence of SOR(w) on the model problem we need
to use another graph theoretic property of the matrix:

Def: A matrix has Property A if there is a permutation P such that
P*A*P = [ A11 A12 ; A21 A22 ] has the property that A11 and A22
are diagonal. In graph theoretic terms, the vertices V of the graph
can be broken into two disjoint sets V = V1 U V2, where there are
no edges between pairs of vertices in V1, or between pairs of
vertices in V2. Such a graph is called bipartite.

Ex: For a model problem on a 1D mesh, let the odd numbered nodes
be V1 and the even numbered by V2; since there are only edges
connected even and odd, the matrix has property A.

Ex: For a model problem on a 2D mesh, think of the vertices
forming a checkerboard so they are either Red or Black.
Then the only edges connect Red to Black, so the matrix has
Property A. Another way to say this is to put vertices
v_ij with i+j odd in V1 and i+j even in V2. This works
for the 3D model problem and in higher dimensions.

Thm 4: Suppose matrix A has Property A, and we do SOR(w)
updating all the vertices in V1 before all the vertices in V2.
Then the eigenvalues mu of R_J and lambda of R_SOR(w) are
related by    
(*)    (lambda+w-1)^2 = lambda*w^2*mu^2
In particular, if w=1, then lambda=mu^2, so 
rho(R_SOR(1)) = rho(R_GS) = rho(R_J)^2, and GS converges 
twice as fast as Jacobi.

More generally, since we know all eigenvalues mu of R_J for
the model problem, we can use (*) to figure out all
eigenvalues lambda of R_SOR(w), then then pick w_opt to 
minimize rho(R_SOR(w_opt)). This yields

Thm 5: Suppose A has property A, we do SOR(w) updating
vertices in V1 before V2 as before, all eigenvalues
of R_J are real, and mu = rho(R_J)<1 (as in the model problem).
Then
    w_opt = 2/(1+sqrt(1-mu^2))
    rho(SOW(w_opt)) = w_opt - 1 = mu^2 / (1+sqrt(1-mu^2))^2
In particular, for the 2D model problem on an n x n grid
    w_opt = 2/(1+sin(pi/(n+1))) ~ 2
    rho(SOW(w_opt)) = cos^2(pi/(n+1))/(1+sin(pi/(n+1)))^2 ~ 1 - 2*pi/(n+1)

Now we begin to prove (some of) these results.

Thm 1: If A is strictly row diagonally dominant:
          |A_ii| > sum_{j neq i} |A_ij|
       Then both Jacobi and GS converge, and GS is at least as fast
       as Jacobi in the sense that ||R_GS||_inf <= ||R_J||_inf < 1
Proof: Split A = D - (D - A) so R_J = D^(-1)*(D-A) = I -D^(-1)*A
       So sum_i |R_J(j,i)| 
            = | 1 - A(j,j)/A(j,j)| +  sum_{i neq j} |A(j,i)|/|A(i,i)|
            = ( sum_{i neq j} |A(j,i)| ) / |A(i,i)|
            < 1 by strict row diagonal dominance
       so ||R_J||_inf < 1 and Jacobi converges

       Showing ||R_GS||_inf <= ||R_J||_inf is more involved:
       Write A = D - L' - U' = diagonal + lower triangle + upper triangle
       as before, so 
           R_J = D^{-1)*(L' + U') = L  +  U
       and R_GS = (D-L')^(-1)*U' = (D*(I-D^(-1)*L')^(-1)*U'
                = (I - D^{-1}*L')^(-1)*(D^(-1)*U')
                = (I - L)^(-1)*U
       We want to show 
           ||R_GS||_inf <= ||R_J||_inf
       or equivalently 
           || |R_GS|*e ||_inf <= || |R_J|*e ||_inf
       where e = [1;1;...;1]. This will be implied by the stronger inequality
           |R_GS|*e <=  |R_J|*e 
       or
       (*)  |(I-L)^(-1)*U|*e <= |L+U|*e = (|L|+|U|)*e
       Now
           (I-L)^(-1) = sum_{i=1 to n} L^i   ... note that higher power of L are zero
       This is s geometric sum, with the same proof as the scalar case, so
          |(I-L)^(-1)*U| = | sum_i L^i*U |
                         <= sum_i |L|^i*|U|  ... triangle inequality
                         = (I - |L|)^(-1)*|U|
       and (*) would be implied by the stronger inequality
       (**)    (I - |L|)^(-1)*|U|*e <= (|L|+|U|)*e
       This would be in turn be implied by multiplying the following inequality
       by the nonnegative matrix (I - |L|)^{-1}:
       (***) |U|*e <= (I - |L|)*(|L|+|U|)*e
       Multiply out (***) and cancel |U|*e to get
             0 <= (|L| - |L|^2 - |L|*|U|)*e
                = |L|*(I - |L| - |U|)*e
                = |L|*(I - |R_J|)*e
       This is (finally!) in turn implied by multiplying |L| times
             0 <= (I - |R_J|)*e
       or
             |R_J|*e <= I*e = e
       or    ||R_J||_inf <= 1
       which was our starting assumption!

We will not prove Thm 2 (for a proof see the book by Varga referenced in the text).

Thm 3: If A is spd, then SOR(w) converges if and only if 0 < w < 2.
In particular SOR(1) = GS converges.

Proof: Abbreviate R_SOR(w) = R. Recall that we write
   A = D - L' - U' = D*(I - L - U) 
and
  R = (D-w*L')^(-1) * ((1-w)*D + w*U')
    = (I - w*L)^(-1) * ((1-w)*I + w*U)
   
First we show that rho(R) >= |w-1|, so that 0 < w < 2 is
required for convergence. We compute the characteristic polynomial of R:
   phi(lambda) = det(lambda*I - R) 
               = det(I-w*L)*det(lambda*I - R)
               = det(lambda*(I-w*L) - (1-w)*I - w*U)
               = det((lambda + w - 1)*I - lambda*w*L - w*U)
and phi(0) = +- prod_i lambda_i(R) = det((w-1)*I - w*U) = (w-1)^n
so some |lambda_i(R)| must exceed |w-1|.

To show SOW(w) converges if 0 < w < 2, let M = (1/w)*D - L', 
Q = A^(-1)*(2*M-A) = 2*A^(-1)*M - I
(1) show Real(lambda_i(Q)) > 0 for all i
(2) show R = (Q-I)*(Q+I)^(-1)
(3) conclude that |lambda_i(R)| < 1 for all i

(1) If Q*x = lambda*x, then (2*M-A)*x = lambda*A*x and 
    x^* * (2*M-A)*x = lambda* x^* * A*x
Add this last equation to its conjugate transpose and 
divide by 2 to get the real part of both sides:
    x^* * (M + M^* - A)*x = Real(lambda)* x^* * A*x
Now 
   M + M^* - A = (2/w)*D - L' - L'^* - (D - L' - L'^*) = (2/w -1)*D
so we can solve for
    Real(lambda) = x^* * (2/w - 1)*D*x / x^* * A * x
                 = positive / positive 
                 = positive

(2) We compute
       (Q-I)*(Q+I)^(-1) = (2*A^(-1)*M - 2*I)*(2*A^{-1)*M)^(-1)
                        = I - M^(-1)*A 
                        = I - ((1/w)*D - L')^(-1)*(D - L' - U')
                        = R
(3) Finally, by the spectral mapping theorem (Question 4.5)
      |lambda(R)| =  | (lambda(Q) - 1) / (lambda(Q) + 1) |
                  =  sqrt( [  (Real(lambda(Q))-1)^2 + Imag(lambda(Q))^2 ] / 
                           [  (Real(lambda(Q))+1)^2 + Imag(lambda(Q))^2 ] )
                  < 1, since Real(lambda(Q)) > 0

Now we consider Krylov Subspace Methods, of which there are many,
depending on the matrix structure. For spd matrices like our model
problem, Conjugate Gradients (CG) is the most widely used, and the
one we will discuss in most detail. We will also summarize
the other methods with a "decision tree" that chooses the right
algorithm depending on properties of your matrix (like symmetry,
positive definiteness, etc).

Unlike Jacobi, GS and SOR, a Krylov subspace method for Ax=b or
A*x = lambda*x need only assume that you have a "black-box" subroutine for
multiplying A times a vector by x (or A^T times x, for some methods). 
So you cannot ask for the diagonal part, or off triangular part, etc. 
that you would need for methods like Jacobi.
This is particularly useful in two situations: 
(1) If you are writing a library for solving A*x=b, the user can simply 
input a pointer to a function that computes A*z for any z, and your
algorithm calls it without having to know any details about how A
is represented.
(2) A may not actually be represented as a matrix with explicit entries, 
but as the result of some evaluating some complicated function.
For example, suppose you have a system, like a car engine or airplane
wing, governed by a complicated set of PDEs. You'd like to optimize
some set y of n outputs of the PDE as a function of some set x of 
n inputs, y=f(x), say the pressure on the wing as a function of
its shape. For many optimization methods (eg Newton's method), this means 
that you need to solve linear systems with the nxn coefficient matrix
A = del f, the Jacobian of f. A is not written down, it is define implicitly 
as the derivative of the outputs of the PDE with respect to its inputs,
and varies depending on the inputs.  The easiest way to access A is to note 
that A*z = del f * z ~ ( f(x+h*z) - f(x) )/h when h is small enough, 
so multiplication by A requires solving the PDE twice to get f(x+h*z) and f(x)
(care must be taken in choosing h small enough but not too small,
because otherwise the result is dominated by error).
There is no such simple way to get A^T * z, unfortunately.

There is actually software that takes an arbitrary collection of C or
Fortran subroutines computing an arbitrary function y = f(x), and 
produces new C or Fortran subroutines that compute del f(x), by differentiating
each line of code automatically; google "ADIFOR" or see
   wiki.mcs.anl.gov/autodiff/

On the other hand, if we assume that the matrix A is available explicitly,
then it is possible to reorganize many Krylov subspace methods to significantly
reduce their communication costs, which are dominated by moving the data needed
to represent the sparse matrix A. For example, when A is too large to fit in
cache (a typical situtation), each step of a conventional sequential method does a 
multiplication A*x, which requires reading the matrix A from slow memory to cache.
Thus the communication cost is proportional to the number of steps, call it k.
However the reorganized Krylov subspace methods can take k steps and only
read A from slow memory once, a potential speedup of a factor of k.
In practice k is limited by both the sparsity structure of A and issues
of numerical stability, that we will discuss briefly later.

We begin by discussing how to extract information about A from a subroutine
that does A*z. Given a starting vector y_1 (say y_1 = b for solving A*x=b), 
we can compute y_2 = A*y_1, ... , y_(i+1) = A*y_i, ... y_n = A*y_(n-1) = A^(n-1)*y_1.
Letting K = [y_1,y_2,...,y_n] we get
    A*K = [A*y_1,...,A*y_n] = [y_2,y_3,...,y_n,A^n*y_1]
Assuming for a moment that K is nonsingular, write c = -K^(-1)*A^n*y_1 to get
    A*K = K*[e_2,e_3,...,e_n,-c] = K*C
where
    C = K^(-1)*A*K = [ 0 0 ... 0 -c_1 ]
                     [ 1 0 ... 0 -c_2 ]
                     [ 0 1 ... 0 -c_3 ]
                     [     ...        ]
                     [ 0 0 ... 1 -c_n ]
is upper Hessenberg, and a companion matrix, i.e. its characteristic polynomial
is simply p(x) = x^n + sum_(i=1 to n) c_i*x^(i-1). So just by matrix-vector
multiplication we have reduced A to a simple form, and could imagine using
C to solve A*x=b or find eigenvalues. 

But this would be a bad idea for two reasons:
(1) K is likely to be dense (if y_1 is dense), and solving with K is
probably harder than solving with A
(2) K is likely to be very ill-conditioned, since it is basically
running the power method, so the vectors y_i are converging to the
eigenvectors of the largest eigenvalue in absolute value.

Krylov subspace methods address these two drawbacks, implicitly or explicitly,
as follows: 
We will not compute K but rather an orthogonal Q such that the
leading k columns of K and of Q span the same space, called a Krylov subspace:
   span{y_1,y_2,...,y_k} = span{y_1,A*y_1,...,A^(k-1)*y_1}
                         = {script K}_k(A,y_1)
The relationship between K and Q is simply the QR decomposition: K = Q*R.
Furthermore, we won't compute all n columns of K and Q, but just the
first k << n columns, and settle for the "best" approximate solution
x of A*x=b or A*x = lambda*x that we can find in the Krylov subspace 
they span (the definition of "best" depends on the algorithm).

To proceed, substitute K = Q*R into C = K^{-1}*A*K to get
    C = R^{-1)*Q^T*A*Q*R
or
(*)    R*C*inv(R) = Q^T*A*Q = H
since R and so R^{-1} are triangular, and C is upper Hessenberg,
then H is also upper Hessenberg. If A is symmetric, then so is H,
and so H is tridiagonal.

To see how to compute the columns of Q one at a time, write
Q = [q_1,q_2,...,q_n], rewrite (*) as
  A*Q = Q*H
and equate the j-th column on both sides to get
  A*q_j = sum_{i=1 to j+1} q_i*H(i,j)
Since the q_j are orthonormal, multiply both sides of the last equality
by q_m^T to get
 q_m^T*A*q_j = sum_{i=1 to j+1} H(i,j)*q_m^T*q_j = H(m,j) for 1 <= m <= j
and so
 H(j+1,j) * q_(j+1) = A*q_j - sum_{i=1 to j} q_i*H(i,j)

This yields the following algorithm:

Arnoldi algorithm for (partial) reduction of A to Hessenberg form:
   q_1 = b / ||b||_1
   for j = 1 to k
      z = A*q_j
      for i = 1 to j ... run Modified Gram-Schmidt (MGS) on z
         H(i,j) = q_i^T*z
         z = z - H(i,j)*q_i
      end for
      H(j+1,j) = ||z||_2
      q_(j+1) = z / H(j+1,j)
   end for

The q_j vectors are called Arnoldi vectors, and the cost is k multiplications 
by A, plus O(k^2*n) flops for MGS. If we stop the algorithm here, at k < n, what have 
we learned about A? Write Q = [Q_k,Q_u] where Q_k = [q_1,...,q_k] is
computed as well as the first column q_{k+1} of Q_u; the other columns of Q_u
are unknown. Then
    H = Q^T*A*Q = [Q_k,Q_u]^T * A * [Q_k,Q_u)
                = [ Q_k^T*A*Q_k  Q_k^T*A*Q_u ] = [ H_k  H_uk ]
                  [ Q_u^T*A*Q_k  Q_u^T*A*Q_u ]   [ H_ku H_u  ]
Since H is upper Hessenberg, so are H_k and H_u, and H_ku
has a single nonzero entry in its upper right corner, namely H(k+1,k).
So H_uk and H_u are unknown, and H_k and H_ku are known.

When A is symmetric, so that H is symmetric and tridiagonal, we may write
    H = T = [ alpha_1 beta_1  0                   ...    ]
            [ beta_1  alpha_2 beta_2  0           ...    ]
            [   0     beta_2  alpha_3 beta_3 0    ...    ] 
            [                  ...                       ]
            [                  ... 0  beta_(n-1) alpha_n ]
Equating column j on both sides of A*Q = Q*T yields
    A*q_j = beta_(j-1)*q_(j-1) + alpha_j*q_j + beta_j*q_(j+1)
Multiplying both sides by q_j^T yields 
    q_j^T*A*q_j = alpha_j
leading to the following simpler version of Arnoldi algorithm,
called Lanczos:

Lanczos algorithm for (partial) reduction of A=A^T to tridiagonal form
   q_1 = b / ||b||_1, beta_0 = 0, q_0 = 0
   for j = 1 to k
      z = A*q_j
      alpha_j = q_j^T*z
      z = z - alpha_j*q_j - beta_(j-1)*q_(j-1)
      beta_j = ||z||_2
      q_(j+1) = z / beta_j
   end for

Here is some notation to describe the Arnoldi and Lanczos algorithms.
As above, the space spanned by {q_1,...,q_k} is called a Krylov subspace:
   {script K}_k(b,A) = span{q_1,...,q_k}
or {script K}_k for short.
We call H_k  (T_k for Lanczos) the projection of A onto {script K}_k.

Our goal is to solve A*x=b or A*x = lambda*x by somehow finding
the "best" solution in the Krylov subspace.

For the eigenvalue problem, the answer is simple: use the eigenvalues
of H_k (or T_k) as approximate eigenvalues of A. To see what the
error is, suppose H_k*y = lambda*y and write
   H * [ y ] = [ H_k  H_uk] * [ y ] = [ H_k*y ] = [ lambda*y          ]
       [ 0 ]   [ H_ku H_u ]   [ 0 ]   [ H_ku*y]   [ H(k+1,k)*y_k*e_1 ]
and since A*Q = Q*H
   A * (Q * [ y ] ) = lambda * (Q * [ y ] ) + q_(k+1)*H(k+1,k)*y_k
            [ 0 ]                   [ 0 ]
so if the product of H(k+1,k)*y_k is small, the eigenvalue/vector pair
(lambda, y') = (lambda, Q*[y ; 0]) has a small residual 
   || A*y' - lambda*y' ||_2 = |H(k+1,k)*y_k|
When A is symmetric, we know from Chapter 5 that this is in fact a
bound on the error in lambda.

To see how the eigenvalues of T_k approximate the eigenvalues of A as k grows,
we consider the symmetric case (Lanczos), and plot the eigenvalue of T_k
versus k (Figure 7.2:  LanczosConvergence9.eps and LanczosConvergence29.eps).
Chapter 7 discusses finding eigenvalues and eigenvectors in more detail.

Now we return to Chapter 6, on solving A*x=b.
Given Q_k = [q_1,...,q_k] our goal is to find the "best" approximate
solution x_k in span(Q_k). To make progress, we need to define "best";
depending on what we choose, we end up with different algorithms:

(1) Choose x_k to minimize || x_k - x ||_2, where x is the true solution.
Unfortunately, we don't have enough information in Q_k and H_k (or T_k)
to compute this.

(2) Choose x_k to minimize the norm || r_k ||_2 of the residual r_k = b-A*x_k.
We can do this, and the algorithm is called MINRES (for "minimum residual")
when A is symmetric, and GMRES (for "generalized minimum residual") when it is not.

(3) Choose x_k so that r_k is perpendicular to {script K}_k, i.e. r_k^T * Q_k = 0.
This is called the orthogonal residual property, or a Galerkin condition, by
analogy to finite elements. When A is symmetric, the algorithm is called
SYMMLQ. When A is nonsymmetric, a variant of GMRES works.

(4) When A is spd, it defines a norm || r ||_{A^(-1)} = (r^T * A^(-1) * r)^(1/2).
We say the best solution minimizes || r_k ||_{A^(-1)}. Note that 
   || r_k ||_A^(-1)^2 = r_k^T*A^(-1)*r_k 
                      = (b-A*x_k)^T*A^{-1)*(b-A*x_k)
                      = (A*x-A*x_k)^T*A^{-1)*(A*x-A*x_k)
                      = (x-x_k)^T*A*(x-x_k)
                      = || x - x_k ||_A^2
This algorithm is called the Conjugate Gradient Algorithm.

Thm: When A is spd, definitions (3) and (4) of "best" are equivalent, and
using the Conjugate Gradient algorithm, it is possible to compute x_k 
from previous iterates for the cost of one multiplication by A, and
a small number of dot products and saxpys (y = alpha*x + y), keeping
only 3 vectors in memory.

More generally, the choice of algorithm depends on the properties of A:
Decision Tree: Figure 6.8 (IterativeTreeAx=b.ps)
See also pointer on class web page to Templates for the solution of Linear Systems
to an expanded version of this tree, and pointers to software.

We now discuss GMRES, which uses the least structure of the matrix, 
and then CG, which uses the most.

In GMRES, at each step we choose x_k to minimie 
   || r_k ||_2 = || b - A*x_k ||_2
where x_k = Q_k*y_k, i.e. x_k is in {script K}_k(b,A). 
Thus we choose y_k to minimize
   || r_k ||_2 = || b - A*Q_k*y_k ||_2
               = || b - A*[Q_k,Q_u]*[ y_k; 0 ] ||_2
               = || Q^T*( b - A*[Q_k,Q_u]*[ y_k; 0 ] ) ||_2
               = || Q^T*b - H*[ y_k; 0 ] ) ||_2
               = || e_1 * ||b||_2 - [ H_k H_uk ] * [ y_k ] ||_2
                                    [ H_ku H_u ]   [  0  ]
               = || e_1 * ||b|_2| - [ H_k  ] * y_k ||_2
                                    [ H_ku ]   
Since only the first row of H_ku is nonzero, we just need to solve the 
k-by-(k+1) upper Hessenberg least squares problem
    || r_k ||_2 = || e_1 * ||b||_2 - [ H_k            ] * y_k ||_2
                  ||                 [ 0...0 H(k+1,k) ]
which can be done inexpensively with Givens rotations, exploiting the
upper Hessenberg structure, costing just O(k^2) instead of O(k^3).

Now we return to CG, and begin by showing that when A is spd,
it is "best" in two senses:

Lemma: When A is spd, (3) and (4) are equivalent:
   (3) Choose x_k so that r_k^T * Q_k = 0.
   (4) Choose x_k to minimize || r_k ||_A^(-1) = || x_k - x ||_A
and are solved by x_k = Q_k*T_k^(-1)*e_1*||b||_2 , where T_k is
the tridiagonal matrix computed by Lanczos. 
Also r_k = +- ||r_k||_2*q_(k+1)

Proof: Drop the subscript k for simpler notation, so Q=Q_k, T=T_k,
x = Q*T^(-1)*e_1*||b||_2, r = b-A*x and T = Q*A*Q^T is nonsingular, 
by Question 6.1 (due today!). Then we need to confirm that
   Q^T * r = Q^T * (b - A*x)
           = Q^T*b - Q^T*A*x
           = e_1*||b||_1 - Q^T*A*Q*T^(-1)*e_1*||b||_2
           = e_1*||b||_1 - T*T^(-1)*e_1*||b||_2
           = 0  as desired
Now we need show that this choice of x minimizes || r ||_A^{-1}^2,
so consider a different x' = x + Q*z and r' = b - A*x' = r - A*Q*z
   || r' ||_A^(-1)^2 = r' * A^(-1) * r
                     = (r - A*Q*z)^T*A^(-1)*(r - A*Q*z)
                     = r^T*A^(-1)*r - 2(A*Q*z)^T*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z)
                     = r^T*A^(-1)*r - 2*z^T*Q^T*A*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z)
                     = r^T*A^(-1)*r - 2*z^T*Q^T*r + (A*Q*z)^T*A^(-1)*(A*Q*z)
                     = r^T*A^(-1)*r               + (A*Q*z)^T*A^(-1)*(A*Q*z)
                            since Q^T*r = 0
                     = || r ||_A^(-1)^2           + || A*Q*z ||_A^(-1)^2
and this is minimized by choosing z = 0, so x' = x as desired.

Finally note that r_k = b - A*x_k must be in {script K}_(k+1), so
that r_k is a linear combination of columns of Q_{k+1} = [q_1,...,q_(k+1)].
But since r_k is perpendicular to the columns of Q_k = [q_1,...,q_k],
r_k must be parallel to q_(k+1), so r_k = +-||r_k||_2 * q_(k+1)


There are several ways to derive CG. We will take the most "direct" way
from the formula above, deriving recurrences for 3 sets of vectors,
of which we only need to keep the most recent ones: x_k, r_k, and
so-called conjugate gradient vectors p_k. 
(1) The p_k's are called gradients because each step of CG can be thought 
of as moving x_(k-1) along the gradient direction 
p_k (i.e. x_k = x_(k-1) + nu * p_k) until it minimizes || r_k ||_A^(-1)
over all choices of the scalar nu.
(2) The p_k's are called conjugate, or more precisely A-conjugate, because
they are orthogonal in the A inner product: p_k^T*A*p_j = 0 if j neq k.

CG is sometimes derived by figuring out recurrences for x_k, r_k and
p_k that satisfy properties (1) and (2), and then showing they satisfy
the optimality properties in the Lemma. A nice presentation is found
in Shewchuk's writeup on the class web page. Instead, we will
start with the formula for x_k = Q_k*T_k^(-1)*e_1*||b||_2 , from
the lemma, and derive the recurrences from there.

Since T_k = Q_k^T*A*Q_k is spd, we can do Cholesky to get 
T_k = L'_k*L'_k^T where L'_k is lower bidiagonal, and
L'_k*L'_k^T =  L_k*D_k*L_k^T where L_k has unit diagonal
and D_k is diagonal, with D_k(i,i) = L'_k(i,i)^2. Then
from the Lemma
   x_k = Q_k*T_k^(-1)*e_1*||b||_2
       = Q_k*(L_k*D_k*L_k^T)^(-1)*e_1*||b||_2
       = Q_k*L_k^(-T) * (D_k^(-1)*L_k^(-1)*e_1*||b||_2)
       = P'_k * y_k
where we write P'_k = [p'_1,...,p'_k]. The eventual conjugate gradients
p_k will turn out to be scalar multiples of the p'_k. So we
know enough to prove property (2) above:

Lemma: The p'_k's are A-conjugate, i.e. P'_k^T * A * P'_k is diagonal.
Proof: P'_k * A * P'_k = (Q_k*L_k^(-T))^T * A * (Q_k*L_k^(-T))
                       = L_k^(-1)* Q_k^T*A*Q_k * L_k^(-T)
                       = L_k^(-1)* T_k * L_k^(-T)
                       = L_k^(-1)* (L_k * D_k * L_k^T) * L_k^(-T)
                       = D_k

Now we derive simple recurrences for the column p'_k of P'_k, and the
components of y_k, which in turn give us a recurrence of x_k.
We will show that the P'_(k-1) is identical to the leading k-1 columns of P'_k:
   P'_k = [P'_(k-1),p'_k]
and similarly for y_(k-1) = [s_1; ... ;s_(k-1)] and y_k = [s_1;...;s_(k-1);s_k].
Assuming these are true for a moment, they will give us the recurrence for x_k:
(Rx)   x_k = P'_k*y_k = [P'_(k-1),p'_k]*[s_1;...;s_k]
                      = P'^(k-1)*[s_1;...;s_(k-1)] + p'_k*s_k
                      = x_(k-1) + p'_k*s_k
assuming we can also get recurrences for p'_k and s_k.

Since Lanczos constructs T_k row by row so T_(k-1) is the 
leading k-1 by k-1 submatrix of T_k, and Cholesky also
computes row by row, we get that L_(k-1) and D_(k-1)
are the leading k-1 by k-1 submatrices of L_k and D_k, resp:
   T_k = L_k * D_k * L_k^T
       = [ L_(k-1)       0 ] * [ D_(k-1)  0 ] * [ L_(k-1)       0 ]^T
         [ 0...0 l_(k-1) 1 ]   [    0    d_k]   [ 0...0 l_(k-1) 1 ]
so  L_k^(-1) = [ L_(k-1)^(-1) 0 ]
               [    stuff     1 ]
and y_k = D_k^(-1)*L_k^(-1)*e_1*||b||_2
        = [ D_(k-1)^(-1)   0     ] * [ L_(k-1)^(-1) 0 ] * e_1 * ||b||_2
          [     0       d_k^(-1) ]   [    stuff     1 ]
        = [ D_(k-1)^(-1) * L_(k-1)^(-1) * e_1 * ||b||_2 ]
          [ s_k ]
        = [ y_(k-1) ]
          [ s_k     ]
as the desired recurrence for y_k. Similary
   P'_k = Q_k * L_k^(-T)
        = [ Q_(k-1) , q_k ] * [ L_(k-1)^(-T) stuff ]
                              [     0          1   ]
        = [ Q_(k-1)*L_(k-1)^(-T) , p'_k ]
        = [ P'_(k-1) , p'_k ]
so to get a formula for p'_k, write 
   Q_k = P'_k * L_k^T,
and equating the last columns we get
   q_k = p'_k + p'_(k-1)*L_k(k,k-1)
or
(Rp)   p'_k = q_k - l_(k-1)*p'_(k-1)
as the desired recurrence for p'_k.

Finally we need a recurrence for r_k from (Rx):
(Rr)   r_k = b - A*x_k
           = b - A*(x_(k-1) + p'_k*s_k)
           = r_(k-1) - A*p'_k*s_k
Putting these vector recurrences together we get 

(Rr)  r_k  = r_(k-1) - A*p'_k*s_k
(Rx)  x_k  = x_(k-1) + p'_k*s_k
(Rp)  p'_k = q_k - l_(k-1)*p'_(k-1)

To get closer to the final recurrences, substitute
q_k = r_(k-1)/||r_(k-1)||_2 and p_k = ||r_(k-1)||_2 * p'_k to get

(Rr')  r_k  = r_(k-1) - A*p_k*(s_k/||r_(k-1)||_2)
            = r_(k-1) - A*p_k*(nu_k)
(Rx')  x_k  = x_(k-1) + p_k*nu_k
(Rp')  p_k = r_(k-1) - (||r_(k-1)||_2 * l_(k-1) / ||r_(k-2)||_2) * p_(k-1)
           = r_(k-1) + mu_k * p_(k-1)

We still need formulas for the scalars mu_k and nu_k. There are
several choices, some more numerically stable than others. 
See the text for the algebra, we just write here

    nu_k = r_(k-1)^T*r_(k-1) / p_k^T*A*p_k
    mu_k = r_(k-1)^T*r_(k-1) / r_(k-2)^T*r_(k-2)

Putting it all together, we get

Conjugate Gradient Method for solving Ax=b:
   k= 0 , x_0 = 0, r_0 = b, p_1 = b
   repeat
     k = k+1
     z = A*p_k
     nu_k = r_(k-1)^T*r_(k-1) / p_k^T*z
     x_k  = x_(k-1) + p_k*nu_k
     r_k  = r_(k-1) - nu_k*z
     mu_(k+1) = r_k^T*r_k / r_(k-1)^T*r_(k-1)
     p_(k+1)  = r_k + mu_(k+1) * p_k
   until || r_k ||_2 small enough
     
Note that minimization property (1) above follows from the fact that 
since x_k minimizes || r_k ||_A^(-1) over all possible x_k in Q_k,
it must certainly also satisfy the lower dimensional minimization
property (1).