Notes for Ma221 Lecture 22, Nov 9 2010 Begin Chapter 6 on iterative methods Goals: Contrast direct and iterative methods For Ax=b or least squares: use iterative methods when direct methods too slow, or use too much memory (from fill-in), or don't need as much accuracy For eigenproblems or SVD: use iterative methods for above reasons, or when only a few eigenvalues/vectors are desired A good iterative method exploits knowledge of the underlying physical or mathematical model that gives rise to matrix Large diversity of methods, no one best Model problem: Poisson equation in 2D or 3D arises in electrostatics, heat flow, quantum mechanics, incompressible fluid mechanics, ... Large diversity of software tools available see web page Summary of iterative methods as applied to Model problem From simple (slow) to sophisticated (fast) (1) simple ones apply to more problems than Model problem (2) simple ones are building blocks for more sophisticated ones Methods covered: Jacobi Gauss-Seidel Successive Overrelaxation (SOR) Krylov Subspace Methods (KSMs) Conjugate Gradients, GMRES,... preconditioning FFT Multigrid Domain decomposition (hybrids of above) Time permitting, eigenvalue algorithms (Chap 7) Optimizing performance Most methods assume only thing you can afford to do is y=A*x SpMV = Sparse-Matrix-Vector multiplication KSMs only need "black box" subroutine for y=A*x for user's A Parallelizing SpMV Communication-avoiding methods reorganize methods to take k steps for each access of A, instead of 1 step possible class projects Poisson's Equation 1D with Dirichlet boundary conditions -d^2 v(x)/d x^2 = f(x) on (0,1) with v(0)=v(1)=0 discretize with centered differences to get T_N h = 1/(N+1), x_i = i*h, v_i = v(i*h), f_i = f(i*h), d v(x)/dx at x=(i-.5)*h ~ (v_i - v_i-1)/h d v(x)/dx at x=(i+.5)*h ~ (v_i+1 - v_i)/h d^2 v(x)/d x^2 at x = i*h ~ ((v_i+1 - v_i)/h - (v_i - v_i-1)/h )/h = ( v_i+1 - 2*v_i + v_i-1 ) / h^2 with truncation error tau_i = O(h^2) so ODE becomes -v_i+1 + 2*v_i - v_i-1 = h^2 (f_i + tau_i) = h^2*f_i + O(h^4) for i=1 to N, with v_0 = v_N+1 = 0 or T_N * [v_1;...;v_N] = T_N * v = h^2*[f_1;...;f_N] = h^2*f where T_N is tridiagonal with diagonals = 2 and offdiagonals = -1. This is often referred to as a "stencil" on a 1D mesh. (draw) Eigenvalues and eigenvectors of T_N Lemma: T_N * z_j = lambda_j * z_j where lambda_j = 2*(1 - cos(pi*j/(N+1)) z_j(k) = sqrt(2/(N+1)) * sin(j*k*pi/(N+1)) and ||z_j||_2 = 1 Proof: Trigonometry (homework!) Corollary: The matrix Z where Z_jk = sqrt(2/(N+1))*sin(j*k*pi/(N+1)) is orthogonal (related to FFT matrix later) (Fig 6.1 in text (graph711.eps) for eigenvalues, Fig 6.2 in text (graph712.eps) for eigenvectors) Eigenvalues range from smallest lambda_j ~ (pi*j/(N+1))^2 for small j to largest lambda_N ~ 4 This tell us the condition number of T_N, lambda_N/lambda_1 ~ (2(N+1)/pi)^2 = (2/pi)^2*h^(-2) and let us bound the error in the solution of the discrete 1D Poisson Eq: || T_N^{-1}*h^2*tau || = O((N+1)^2*h^4) = O(h^2) Let us compare the eigenvalues and eigenvectors of T_N with the eigenvalues and eigenfunctions of the differential equation: -d^2 v'(x) / d^2 x = lambda'*v'(x) with v'(0) = v'(1) = 0 We see that v'_j(x) must be a linear combination of sin(sqrt(lambda)*x) and cos(sqrt(lambda)*x). v'(0) = 0 means it is a multiple just of sin(sqrt(lambda)*x), and v'(1)=0 means sqrt(lambda) = j*pi, or lambda'_j = (j*pi)^2 and v'_j(x) is a multiple (say 1) of sin(j*pi*x). Thus the eigenvectors v_i of T_N are exactly equal to the scaled eigenvectors v'(i*h) of the ODE, sampled at x_i = i*h, and when j is small, the eigenvalues lambda'_j =(j*pi)^2 are well-approximated by h^(-2)*lambda_j. 2D with Dirichlet boundary conditions -d^2 v(x,y) / dx^2 - d^2 v(x,y) / dy^2 = f(x,y) on square (0,1)^2, with v(x,y)=0 on boundary of square Discretizing as before we get v_ij ~ v(i*h,j*h) etc, and d^2 v(x,y) / dx^2 ~ ( v_i-1,j - 2*v_i,j + v_i+1,j ) / h^2 d^2 v(x,y) / dy^2 ~ ( v_i,j-1 - 2*v_i,j + v_i,j+1 ) / h^2 and adding these and multiplying by -h^2 4*v_ij - v_i-1,j - v_i+1,j - v_i,j-1 - v_i,j+1 = h^2*f_ij This is often referred to as a "stencil" on a 2D mesh. (draw) Letting V be the NxN matrix of v_ij, we can also write 2*v_ij - v_i-1,j - v_i+1,j = (T_N*V)_ij 2*v_ij - v_i,j-1 - v_i,j+1 = (V*T_N)_ij or 4*v_ij - v_i-1,j - v_i+1,j - v_i,j-1 - v_i,j+1 = (T_N*V + V*T_N)_ij and T_N*V + V*T_N = h^2*F a system of N^2 linear equations for the N^2 entries of V, though not in the usual "A*x=b" form. We can ask about the corresponding eigenvalues and eigenvectors in the analogous form T_N*V + V*T_N = lambda*V Suppose that T_N*z_i = lambda_i*z_i and T_N*z_j = lambda_j*z_j are any two eigenpairs of T_N. Letting V = z_i * z_j^T , we get T_N*V + V*T_N = T_N*z_i*z_j^T + z_i*z_j^T*T_N = (T_N*z_i)*z_j^T + z_i*(z_j^T*T_N) = (lambda_i*z_i)*z_j^T + z_i*(z_j^T*lambda_j) = (lambda_i + lambda_j)*(z_i*z_j^T) = ( lambda_i + lambda_j ) * V so that V is an "eigenvector". Since there are N^2 unknowns, we expect N^2 eigenpairs, for all possible sums of pairs lambda_i + lambda_j. Comparing V with the eigenvalues and eigenfunctions of the 2D PDE, we see that (homework!) v_ij(x,y) = sin(i*pi*x) * sin(j*pi*y) is an eigenfunction with eigenvalue i^2*pi^2 + j^2*pi^2 = (i^2+j^2)*pi^2 so like the discrete case, we just multiply all pairs of eigenvectors component-by-component, and add the corresponding eigenvalues. (Fig 6.3 in text (2DPoissonEV1.eps, 2DPoissonEV2.eps for eigenvectors and contour plots) We want to write V as a single vector in a way that easily extends to higher dimensional Poisson equations: In the 3x3 case, write V columnwise from left to write as: v = [ v_11; v_21; v_31; v_12; v_22; v_32; v_13; v_23; v_33 ] and T_NxN as [ 4 -1 0 -1 ] [ -1 4 -1 -1 ] [ 0 -1 4 -1 ] [ -1 4 -1 0 -1 ] [ -1 -1 4 -1 -1 ] [ -1 0 -1 4 -1 ] [ -1 4 -1 0 ] [ -1 -1 4 -1 ] [ -1 0 -1 4 ] The pattern is that in each row, the 4 on the diagonal multiplies v_ij, -1s next to the diagonal multiply v_i-1,j and v_i+1,j -1s N away from the diagonal multiply v_i,j-1 and v_i,j+1 Note that some rows have fewer than 4 -1s, when these are on the boundary of the square. One may confirm that T_NxN * v = h^2*f Another way to write T_NxN is T_NxN = [ T_N + 2*I_N -I_N ] [ -I_N T_N + 2*I_N -I_N ] [ ... ] [ -I_N T_N + 2*I_N ] a pattern we will generalize to arbitrary dimensions, using Kronecker products Poissons's equation in d dimensions We need to convert arrays most naturally indexed by 2 or more indices (like V above) into vectors. Def: Let X be mxn, then vec(X) is the m*n x 1 vector gotten by stacking the columns of X atop one another, from left to right. (In MatLab, vec(X) = reshape(X,m*n,1)) Def: Let A be mxn and B be pxq, then A kron B is the m*p x n*q matrix [ A_11*B A_12*B ... A_1n*B ] [ A_21*B A_22*B ... A_2n*B ] [ ... ] [ A_m1*B A_m2*B ... A_mn*B ] (in Matlab, A kron B = kron(A,B)) Lemma: Let A be mxm, B be nxn and X and C be mxn, then 1. vec(A*X) = (I_n kron A) * vec(X) 2. vec(X*B) = (B^T kron I_m) * vec(X) 3. The 2D Poisson equation T_N*V + V*T_N = h^2*F may be written (I_N kron T_N + T_N kron I_N) * vec(V) = h^2*vec(F) Proof of 1: I_n kron A = diag(A,A,...,A) n times so (I_n kron A) * vec(X) = diag(A,A,...,A)*[X(:,1); X(:,2); ... ; X(:,n)] = [ A*X(:,1); A*X(:,2); ... ; A*X(:,n) ] = vec(A*X) Proof of 2: similar (homework!) Proof of 3: Apply part 1 to T_N*V and part 2 to V*T_N, noting the T_N is symmetric Note that I_N kron T_N + T_N kron I_N = [ T_N ] + [ 2*I_N -I_N ] [ T_N ] [ -I_N 2*I_N -I_N ] [ ... ] [ ... ] [ T_N ] [ -I_N 2*I_N ] In Matlab, this is two lines: TN = 2*eye(N) - diag(ones(N-1,1),-1) - diag(ones(N-1,1),+1) TNN = kron(eye(N),TN) + kron(TN,eye(N)) Lemma (homework!) (1) Assume the A*C and B*D are well defined. Then (A kron B)*(C kron D) = (A*C) kron (B*D) (2) If A and B are invertible then (A kron B)^(-1) = A^(-1) kron B^(-1) (3) (A kron B)^T = A^T kron B^T Prop: Let T = Z*Lambda*Z^T be the eigendecomposition of the NxN symmetric matrix T. Then the eigendecomposition of I kron T + T kron I = (Z kron Z) * (I kron Lambda + Lambda kron I) * (Z^T kron Z^T) here I kron Lambda + Lambda kron I is a diagonal matrix whose (i*N+j)th entry is lambda_i + lambda_j, and Z kron Z is orthogonal with (i*N+j)th column z_i kron z_j. Proof: Multiply out the factorization using the last lemma to get (Z kron Z) * (I kron Lambda + Lambda kron I) * (Z^T kron Z^T) = (Z*I kron Z*Lambda + Z*Lambda kron Z*I) * ( Z^T kron Z^T) = (Z*I*Z^T kron Z*Lambda*Z^T + Z*Lambda*Z^T kron Z*I*Z^T) = (I kron T + T kron I) as desired Similarly, Poisson's equation in 3D leads to the matrix T_NxNxN = ( T_N kron I_N kron I_N ) + ( I_N kron T_N kron I_N ) + ( I_N kron I_N kron T_N ) with eigenvalue matrix ( Lambda_N kron I_N kron I_N ) + ( I_N kron Lambda_N kron I_N ) + ( I_N kron I_N kron Lambda_N ) i.e. N^3 eigenvalues lambda_i + lambda_j + lambda_k for all triples (i,j,k), and corresponding eigenvector matrix Z kron Z kron Z. Poisson's equation equation in d dimensions is similarly represented. Summary of performance of iterative methods as applied to Discrete Poisson equation in 2D (and 3D). "Parallel Steps" is on a machine with as many processors as you need (#Procs), and so refers only to the available parallelism, not how you would actually implement it on a real machine with a smaller number of processors. Some explanations and abbreviations: Explicit inverse assumes we have precomputed the exact inverse of the matrix (and are not counting the cost of doing this!) and only need to multiply by it SOR = Successive overrelaxation SSOR/Chebyshev = Symmetric SOR with Chebyshev Acceleration FFT = Fast Fourier Transform BCR = Block Cyclic Reduction Lower Bound is a lower bound that assumes one flop per component of the output vector For 2D Poisson on an nxn mesh, we let N = n^2 = #unknowns For 3D Poisson on an nxnxn mesh, we let N = n^3 = #unknowns If the table entry for 3D Poisson differs from 2D, it is shown in parentheses. All table entries are meant in a Big-Oh sense. #Parallel steps is the fewest possible, given the largest useful #Procs (so #Procs is much larger than what you would have in practice, and is just a theoretical limit on how much parallelism there is in the problem). Method Direct or #Flops Mem #Parallel #Procs Iterative Steps Dense Cholesky D N^3 N^2 N N^2 works on any spd matrix ------------------------------------------------------------------------- Explicit Inverse D N^2 N^2 log N N^2 ------------------------------------------------------------------------- Band Cholesky D N^2 N^(3/2) N N (N^(7/3)) (N^(5/3)) (N^(4/3)) Works on any spd band matrix with cost O(N*bw^2), 2D: bw = n = N^(1/2) 3D: bw = n^2 = N^(2/3) ------------------------------------------------------------------------- Jacobi I N^2 N N N 3D case: Homework! (N^(5/3) (N^(2/3)) #Flops = O(#Flops(SpMV)* #iterations), where #Flops(SpMV) = O(#nonzeros in A) and #iterations ~ cond(A) = n^2 in 2D and 3D Works on any diagonally dominant matrix, convergence rate depends on cond(A) ------------------------------------------------------------------------- Gauss-Seidel I N^2 N N N 3D case: Homework! (N^(5/3) (N^(2/3)) cost analogous to Jacobi (constants differ) Works on any diagonally dominant or spd matrix ------------------------------------------------------------------------- Sparse Cholesky D N^(3/2) N*log N N^(1/2) N (N^2) (N^(4/3)) (N^(2/3)) (N^(4/3)) Assumes we are using the best ordering known (nested dissection) Works on any spd matrix; complexity arises from cost of dense Cholesky on trailing nxn submatrix (in 2D) or n^2xn^2 submatrix (in 3D) ------------------------------------------------------------------------- Conjugate Gradients D N^(3/2) N N^(1/2)*log N N 3D case: Homework! (N^(4/3)) (N^(1/3))*log N) N #Flops =O(#Flops(SpMV)* #iterations), where cost(SpMV) = O(#nonzeros in A) and #iterations = O(sqrt(cond(A))) Works on any spd matrix ------------------------------------------------------------------------- SOR I N^(3/2) N N^(1/2) N 3D case: Homework! (N^(4/3)) (N^(4/3)) N SOR = Successive Overrelaxation Works on any spd matrix, convergence rate depends on cond ------------------------------------------------------------------------- SSOR/Chebyshev I N^(5/4) N N^(1/4) N 3D case: Homework! (N^(7/6)) (N^(1/6)) #Flops = O(#Flops(SpMV) * #iterations), where #flops(SpMV) = O(#nonzeros in A) and #iterations = (cond(A))^(1/4) Works on any spd matrix, but need to know range of eigenvalues ----------------------------------------------------------------------- FFT D N*log N N log N N Works on model problem ------------------------------------------------------------------------- BCR D N*log N N ? ? BCR = Block Cyclic Reduction 3D case: Homework! Works on problems slightly more general than model problem ------------------------------------------------------------------------- Multigrid I N N log^2 N N Many variants of this algorithm, which work quite generally on problems arising from elliptic PDE and FEM models ----------------------------------------------------------------------- Lower bound N N log N ----------------------------------------------------------------------- Basic iterative methods: given an initial guess x_0 for a solution of A*x=b, cheaply compute a sequence x_i converging to A^(-1)*b. Def: A splitting of A is a decomposition A = M - K with M nonsingular This yields the following iterative method: Write A*x = M*x - K*x = b, so M*x = K*x + b. Then compute x_(i+1) from x_i by solving M*x_{i+1} = K*x_i + b, or x_{i+1} = M^(-1)*K*x_i + M^(-1)*b = R*x_i + c. (*) x_{i+1} = R*x_i + c For this to work well, we need two conditions: (1) it should converge (2) multiplying by R (i.e. solving M*x_(i+1) = K*x_i+c for x_(i+1)) and computing c = M^{-1}*b should be much cheaper than solving with A itself. Lemma: Let ||.|| be any operator norm. Then if ||R||<1, (*) converges to A^(-1)*b for any x_0. Proof: Subtract x = R*x + c from x_{i+1} = R*x_i+c to get x_(i+1) - x = R*(x_i-x) or ||x_(i+1)-x|| <= ||R||*||x_i-x|| <= ... <= ||R||^(i+1)*||x_0-x|| which converges to zero for any x_0 if ||R||<1. Def: The spectral radius of R is rho(R) = max |lambda|, the largest absolute value of any eigenvalue of R Lemma: For all operator norms, rho(R) <= ||R||. For all R and all eps>0, there exists an operator norm ||.||_* such that ||R||_* <= rho(R) + eps Proof: To show rho(R) <= ||R||, note that ||R|| = max_x ||R*x||/||x|| By picking x to be an eigenvector, we see that ||R|| >= |lambda| for any eigenvalue |lambda| or R. To construct ||.||_*, we use the Jordan Form of R: Let S^{-1}*R*S = J be in Jordan form, i.e. J is block diagonal with Jordan blocks. Let D = diag(1,eps,eps^2,...,eps^(n-1)). Then J_eps = D^{-1}*J*D leaves the diagonal (eigenvalues) unchanged, and multiplies each superdiagonal entry of J by eps, so J has eigenvalues on the diagonal, and eps above the diagonal in any Jordan blocks. We now define a new vector norm by ||x||_* = ||(S*D)^{-1}*x||_inf. (See Question 1.5 for the proof that this defines a vector norm.) Then ||R||_* = max_x ||R*x||_* / ||x||_* = max_x ||(S*D)^(-1)*R*x||_inf / ||(S*D)^(-1)*x||_inf = max_y ||(S*D)^(-1)*R*(S*D)*y||_inf / ||y||_inf where y = (S*D)^(-1)*x = max_y ||D^(-1)*S^(-1)*R*S*D*y||_inf / ||y||_inf = max_y ||J_eps*y||_inf / ||y||_inf <= max |lambda| + eps = rho(R) + eps Theorem: The iteration x_{i+1} = R*x_i + c converges to the solution of A*x=b for all starting vectors x_0 if and only if rho(R)<1. Proof: If rho(R) >= 1, chose x_0 so that x_0-x is an eigenvector of R for its largest eigenvalue, call it lambda. Then x_i-x = R^i*(x_0-x) = lambda^i*(x_0-x) does not converge to zero since |lambda| >= 1. If rho(R) < 1, use the last lemma to conclude that there is an operator norm ||R||_* and an eps such that ||R||_* <= rho(R)+eps < 1, so that x_i-x = R^i*(x_0-x) converges to zero for all x_0. Obviously, the smaller is rho(R), the faster is convergence. Our goals is to pick the splitting A = M-K so that multiplying by R = inv(M)*K is easy, and rho(R) is small. Choosing M = I and K = I-A makes multiplying by R easy, but may not generally make rho(R) small. Choosing K = 0 and so R = 0 makes convergence immediate, but requires having the solution c = A^{-1}*b. Now we can describe Jacobi, Gauss-Seidel (GS) and Successive Overrelaxation (SOR). All share the notation A = D - L' - U' = D*(I - L - U), where D is the diagonal of A, L' is the negative of the strictly lower triangular part, and U' is the negative of the strictly upper triangular part of A. Jacobi: In words, for j = 1 to n, pick x_(i+1)(j) to exactly satisfy equation j As a loop: for j = 1 to n, x_(i+1)(j) = (b_j - sum_{k neq j} A(j,k)*x_i(k))/A(j,j) As a splitting: A = D - (L'+U') = M - K, so R_J = M^{-1}*K = D^{-1}*(L'+U') = L+U For the 2D Poisson equation T_N*V+V*T_N = h^2*F, to get from V_i to V_(i+1): for j = 1 to n, for k = 1 to n V_(i+1)(j,k) = (V_i(j-1,k) + V_i(j+1,k) + V_i(j,k-1) + V_i(j,k+1) + h^2*F(j,k))/4 = "average" of 4 nearest neighbors and right-hand-side Gauss-Seidel: In words, improve on Jacobi by using most recently updated values of solution As a loop: for j = 1 to n, x_(i+1)(j) = (b_j - sum_{k < j} A(j,k)*x_(i+1)(k) ... updated x components - sum_{k > j} A(j,k)*x_i(k) ) / A(j,j) ... older x components As a splitting: A = (D-L') - U', so R_GS = (D-L')^{-1}*U' = (D*(I-D^{-1}*L'))^{-1}*U' = (I-L)^{-1}*U Note that the order in which we update entries matters in GS (unlike Jacobi). For 2D Poisson there are two orders to consider: natural order (rowwise or columnwise update of V(i,j)) Red-Black (RB) ordering: color the entries in the 2D mesh like a checkerboard, and number the Reds before all the Blacks. Note that each Red node only has Black neighbors, and vice-versa. Thus when updating Red nodes, we can update them in any order, since all the Black neighbors have old data. Then when we update the Black nodes, we can update them in any order, because all the Red neighbors have new data. For all (j,k) nodes that are Red (j+k even) V_(i+1)(j,k) = (V_i(j-1,k) + V_i(j+1,k) + V_i(j,k-1) + V_i(j,k+1) + h^2*F(j,k))/4 ... only use old data For all (j,k) nodes that are Black (j+k odd) V_(i+1)(j,k) = (V_(i+1)(j-1,k) + V_(i+1)(j+1,k) + V_(i+1)(j,k-1) + V_(i+1)(j,k+1) + h^2*F(j,k))/4 ... only use new data SOR: In words: This depends on a parameter w: We let the result of SOR be a weighted combination of the old (x_i) and new (x_(i+1)) solutions that GS would have computed: x^SOR(w)_(i+1)(j) = (1-w)*x^GS_i(j) + w*x^GS_(i+1)(j) When w=1, SOR(1) is just GS. When w<1 we would call this underrelaxation, and when w>1, we call it overrelaxation. We prefer to "overrelax" with the intuition that if moving in the direction from x_i to x_(i+1) was a good idea, moving even farther in the same direction is better. Later we will show how to pick w optimally for the model problem. As a loop: for j = 1 to n, x_(i+1)(j) = (1-w)*x_i(j) + w*(b_j - sum_{k < j} A(j,k)*x_(i+1)(k) ... updated x components - sum_{k > j} A(j,k)*x_i(k) ) / A(j,j) ... older x components As a splitting multiply through by D to get (D-w*L')*x_(i+1) = ((1-w)*D + w*U')*x_i + w*b and then divide by w to get the splitting A = (D/w - L') - (D/w-D + U') or R_SOW(w) = (D/w - L')^(-1) * (D/w - D + U') = (I-w*L)^(-1)*((1-w)*I + w*U) For 2D Poisson with Red-Black Ordering: For all (j,k) nodes that are Red V_(i+1)(j,k) = (1-w)*V_i(j,k) + w* (V_i(j-1,k) + V_i(j+1,k) + V_i(j,k-1) + V_i(j,k+1) + h^2*F(j,k))/4 ... only use old data For all (j,k) nodes that are Black V_(i+1)(j,k) = (1-w)*V_i(j,k) + w* (V_(i+1)(j-1,k) + V_(i+1)(j+1,k) + V_(i+1)(j,k-1) + V_(i+1)(j,k+1) + h^2*F(j,k))/4 ... use new data Now we analyze the convergence of these basic methods, in general and for the Model Problem (2D Poisson) in particular. Jacobi's method for the Model problem is easy to analyze, since in the splitting T_nxn = M - K = 4*I - (4*I - T_nxn), so R = M^(-1)*K = I - T_nxn/4, so the eigenvalues of R are 1 - (lambda_i+lambda_j)/4 for all pairs of eigenvalues lambda_i = 2*(1-cos(i*pi/(n+1))) and lambda_j of T_n, and so the spectral radius rho(R) of R is 1 - lambda_min/2 = 1 - (1 - cos(pi/(n+1))) = cos(pi/(n+1) ~ 1 - pi^2/2(n+1)^2 which gets closer to 1 as n grows, i.e. Jacobi converges more slowly. Since the error after m steps is multiplied by rho(R)^m, we estimate the speed of convergence by computing how many steps m are needed to reduce the error by a constant factor. Setting rho(R) = 1 - x and solving (1-x)^m = exp(-1) for simplicity, we get (1 - x)^m = (1-x)^[(1/x)*m*x] ~ exp(-m*x) = exp(-1) or m = 1/x = 2*(n+1)^2/pi^2 = O(n^2) = O(N) steps, where N=n^2 is the number of unknowns. Note that cond(T_nxn)~4*N/pi^2 is also O(N). So the number of iterations grows proportionally to the dimension N, and to the condition number. So to reduce the error by any constant factor costs O(N) iterations * O(N) flops_per_iteration, or O(N^2) overall, explaining the entry in the table above. This is typical for many iterative methods: the number of iterations grows with the condition number (multigrid is an important exception!). Later we will show that, provided the variables are updated in the appropriate order, rho(R_GS) = rho(R_J)^2, i.e. one step of Gauss-Seidel reduces the error as much as two Jacobi steps; this is better but the overall complexity O(N^2) is the same. Then, again with the right update order, and the right choice of overrelaxation parameter w, SOR(w) is much faster with rho(R_SOR(w_opt)) ~ 1 - 2*pi/n for large n, which is close to 1 but much farther away than rho(R_J), and takes only O(n) = O(sqrt(N)) steps to reduce the error by a constant factor, for an overall cost of O(N^(3/2)) as opposed to O(N^2). Now we present (and prove some of!) the general theory of these three methods applied to Ax=b. Thm 1: If A is strictly row diagonally dominant: |A_ii| > sum_{j neq i} |A_ij| Then both Jacobi and GS converge, and GS is at least as fast as Jacobi in the sense that ||R_GS||_inf <= ||R_J||_inf < 1 The model problem is not strictly row diagonally dominant, since the diagonal equals the negative sum of the off diagonals in most rows. We call a matrix where |A_ii| >= sum_{j neq i} |A_ij| in all rows, with strict inequality at least once, weakly row diagonally dominant. This alone is not enough for convergence: consider A = [ 1 -1 0; 1 1 0; 0 0 1] and R = [ 0 1 0; -1 0 0; 0 0 0], whose powers do not converge. So we need one more condition: Def: A matrix is A is irreducible if there is no permutation P such that P*A*P^T is block triangular. Equivalently, the graph corresponding to entries of A is "strongly connected", i.e. there is a path from every vertex to every other vertex. The model problem satisfies this definition, since a mesh is strongly connected, and except for the vertices next to a boundary of the mesh, |A_ii| > sum_{j neq i} |A_ij| Thm 2: If A is weakly diagonally dominant and irreducible (like the model problem) then rho(R_GS) < rho(R_J) < 1, so both Jacobi and GS converge, and GS is faster. Now we come to SOR(w) Thm 3: If A is spd, then SOR(w) converges if and only if 0 < w < 2. In particular SOR(1) = GS converges. To analyze the convergence of SOR(w) on the model problem we need to use another graph theoretic property of the matrix: Def: A matrix has Property A if there is a permutation P such that P*A*P = [ A11 A12 ; A21 A22 ] has the property that A11 and A22 are diagonal. In graph theoretic terms, the vertices V of the graph can be broken into two disjoint sets V = V1 U V2, where there are no edges between pairs of vertices in V1, or between pairs of vertices in V2. Such a graph is called bipartite. Ex: For a model problem on a 1D mesh, let the odd numbered nodes be V1 and the even numbered by V2; since there are only edges connected even and odd, the matrix has property A. Ex: For a model problem on a 2D mesh, think of the vertices forming a checkerboard so they are either Red or Black. Then the only edges connect Red to Black, so the matrix has Property A. Another way to say this is to put vertices v_ij with i+j odd in V1 and i+j even in V2. This works for the 3D model problem and in higher dimensions. Thm 4: Suppose matrix A has Property A, and we do SOR(w) updating all the vertices in V1 before all the vertices in V2. Then the eigenvalues mu of R_J and lambda of R_SOR(w) are related by (*) (lambda+w-1)^2 = lambda*w^2*mu^2 In particular, if w=1, then lambda=mu^2, so rho(R_SOR(1)) = rho(R_GS) = rho(R_J)^2, and GS converges twice as fast as Jacobi. More generally, since we know all eigenvalues mu of R_J for the model problem, we can use (*) to figure out all eigenvalues lambda of R_SOR(w), then then pick w_opt to minimize rho(R_SOR(w_opt)). This yields Thm 5: Suppose A has property A, we do SOR(w) updating vertices in V1 before V2 as before, all eigenvalues of R_J are real, and mu = rho(R_J)<1 (as in the model problem). Then w_opt = 2/(1+sqrt(1-mu^2)) rho(SOW(w_opt)) = w_opt - 1 = mu^2 / (1+sqrt(1-mu^2))^2 In particular, for the 2D model problem on an n x n grid w_opt = 2/(1+sin(pi/(n+1))) ~ 2 rho(SOW(w_opt)) = cos^2(pi/(n+1))/(1+sin(pi/(n+1)))^2 ~ 1 - 2*pi/(n+1) Now we begin to prove (some of) these results. Thm 1: If A is strictly row diagonally dominant: |A_ii| > sum_{j neq i} |A_ij| Then both Jacobi and GS converge, and GS is at least as fast as Jacobi in the sense that ||R_GS||_inf <= ||R_J||_inf < 1 Proof: Split A = D - (D - A) so R_J = D^(-1)*(D-A) = I -D^(-1)*A So sum_i |R_J(j,i)| = | 1 - A(j,j)/A(j,j)| + sum_{i neq j} |A(j,i)|/|A(i,i)| = ( sum_{i neq j} |A(j,i)| ) / |A(i,i)| < 1 by strict row diagonal dominance so ||R_J||_inf < 1 and Jacobi converges Showing ||R_GS||_inf <= ||R_J||_inf is more involved: Write A = D - L' - U' = diagonal + lower triangle + upper triangle as before, so R_J = D^{-1)*(L' + U') = L + U and R_GS = (D-L')^(-1)*U' = (D*(I-D^(-1)*L')^(-1)*U' = (I - D^{-1}*L')^(-1)*(D^(-1)*U') = (I - L)^(-1)*U We want to show ||R_GS||_inf <= ||R_J||_inf or equivalently || |R_GS|*e ||_inf <= || |R_J|*e ||_inf where e = [1;1;...;1]. This will be implied by the stronger inequality |R_GS|*e <= |R_J|*e or (*) |(I-L)^(-1)*U|*e <= |L+U|*e = (|L|+|U|)*e Now (I-L)^(-1) = sum_{i=1 to n} L^i ... note that higher power of L are zero This is s geometric sum, with the same proof as the scalar case, so |(I-L)^(-1)*U| = | sum_i L^i*U | <= sum_i |L|^i*|U| ... triangle inequality = (I - |L|)^(-1)*|U| and (*) would be implied by the stronger inequality (**) (I - |L|)^(-1)*|U|*e <= (|L|+|U|)*e This would be in turn be implied by multiplying the following inequality by the nonnegative matrix (I - |L|)^{-1}: (***) |U|*e <= (I - |L|)*(|L|+|U|)*e Multiply out (***) and cancel |U|*e to get 0 <= (|L| - |L|^2 - |L|*|U|)*e = |L|*(I - |L| - |U|)*e = |L|*(I - |R_J|)*e This is (finally!) in turn implied by multiplying |L| times 0 <= (I - |R_J|)*e or |R_J|*e <= I*e = e or ||R_J||_inf <= 1 which was our starting assumption! We will not prove Thm 2 (for a proof see the book by Varga referenced in the text). Thm 3: If A is spd, then SOR(w) converges if and only if 0 < w < 2. In particular SOR(1) = GS converges. Proof: Abbreviate R_SOR(w) = R. Recall that we write A = D - L' - U' = D*(I - L - U) and R = (D-w*L')^(-1) * ((1-w)*D + w*U') = (I - w*L)^(-1) * ((1-w)*I + w*U) First we show that rho(R) >= |w-1|, so that 0 < w < 2 is required for convergence. We compute the characteristic polynomial of R: phi(lambda) = det(lambda*I - R) = det(I-w*L)*det(lambda*I - R) = det(lambda*(I-w*L) - (1-w)*I - w*U) = det((lambda + w - 1)*I - lambda*w*L - w*U) and phi(0) = +- prod_i lambda_i(R) = det((w-1)*I - w*U) = (w-1)^n so some |lambda_i(R)| must exceed |w-1|. To show SOW(w) converges if 0 < w < 2, let M = (1/w)*D - L', Q = A^(-1)*(2*M-A) = 2*A^(-1)*M - I (1) show Real(lambda_i(Q)) > 0 for all i (2) show R = (Q-I)*(Q+I)^(-1) (3) conclude that |lambda_i(R)| < 1 for all i (1) If Q*x = lambda*x, then (2*M-A)*x = lambda*A*x and x^* * (2*M-A)*x = lambda* x^* * A*x Add this last equation to its conjugate transpose and divide by 2 to get the real part of both sides: x^* * (M + M^* - A)*x = Real(lambda)* x^* * A*x Now M + M^* - A = (2/w)*D - L' - L'^* - (D - L' - L'^*) = (2/w -1)*D so we can solve for Real(lambda) = x^* * (2/w - 1)*D*x / x^* * A * x = positive / positive = positive (2) We compute (Q-I)*(Q+I)^(-1) = (2*A^(-1)*M - 2*I)*(2*A^{-1)*M)^(-1) = I - M^(-1)*A = I - ((1/w)*D - L')^(-1)*(D - L' - U') = R (3) Finally, by the spectral mapping theorem (Question 4.5) |lambda(R)| = | (lambda(Q) - 1) / (lambda(Q) + 1) | = sqrt( [ (Real(lambda(Q))-1)^2 + Imag(lambda(Q))^2 ] / [ (Real(lambda(Q))+1)^2 + Imag(lambda(Q))^2 ] ) < 1, since Real(lambda(Q)) > 0 Now we consider Krylov Subspace Methods, of which there are many, depending on the matrix structure. For spd matrices like our model problem, Conjugate Gradients (CG) is the most widely used, and the one we will discuss in most detail. We will also summarize the other methods with a "decision tree" that chooses the right algorithm depending on properties of your matrix (like symmetry, positive definiteness, etc). Unlike Jacobi, GS and SOR, a Krylov subspace method for Ax=b or A*x = lambda*x need only assume that you have a "black-box" subroutine for multiplying A times a vector by x (or A^T times x, for some methods). So you cannot ask for the diagonal part, or off triangular part, etc. that you would need for methods like Jacobi. This is particularly useful in two situations: (1) If you are writing a library for solving A*x=b, the user can simply input a pointer to a function that computes A*z for any z, and your algorithm calls it without having to know any details about how A is represented. (2) A may not actually be represented as a matrix with explicit entries, but as the result of some evaluating some complicated function. For example, suppose you have a system, like a car engine or airplane wing, governed by a complicated set of PDEs. You'd like to optimize some set y of n outputs of the PDE as a function of some set x of n inputs, y=f(x), say the pressure on the wing as a function of its shape. For many optimization methods (eg Newton's method), this means that you need to solve linear systems with the nxn coefficient matrix A = del f, the Jacobian of f. A is not written down, it is define implicitly as the derivative of the outputs of the PDE with respect to its inputs, and varies depending on the inputs. The easiest way to access A is to note that A*z = del f * z ~ ( f(x+h*z) - f(x) )/h when h is small enough, so multiplication by A requires solving the PDE twice to get f(x+h*z) and f(x) (care must be taken in choosing h small enough but not too small, because otherwise the result is dominated by error). There is no such simple way to get A^T * z, unfortunately. There is actually software that takes an arbitrary collection of C or Fortran subroutines computing an arbitrary function y = f(x), and produces new C or Fortran subroutines that compute del f(x), by differentiating each line of code automatically; google "ADIFOR" or see wiki.mcs.anl.gov/autodiff/ On the other hand, if we assume that the matrix A is available explicitly, then it is possible to reorganize many Krylov subspace methods to significantly reduce their communication costs, which are dominated by moving the data needed to represent the sparse matrix A. For example, when A is too large to fit in cache (a typical situtation), each step of a conventional sequential method does a multiplication A*x, which requires reading the matrix A from slow memory to cache. Thus the communication cost is proportional to the number of steps, call it k. However the reorganized Krylov subspace methods can take k steps and only read A from slow memory once, a potential speedup of a factor of k. In practice k is limited by both the sparsity structure of A and issues of numerical stability, that we will discuss briefly later. We begin by discussing how to extract information about A from a subroutine that does A*z. Given a starting vector y_1 (say y_1 = b for solving A*x=b), we can compute y_2 = A*y_1, ... , y_(i+1) = A*y_i, ... y_n = A*y_(n-1) = A^(n-1)*y_1. Letting K = [y_1,y_2,...,y_n] we get A*K = [A*y_1,...,A*y_n] = [y_2,y_3,...,y_n,A^n*y_1] Assuming for a moment that K is nonsingular, write c = -K^(-1)*A^n*y_1 to get A*K = K*[e_2,e_3,...,e_n,-c] = K*C where C = K^(-1)*A*K = [ 0 0 ... 0 -c_1 ] [ 1 0 ... 0 -c_2 ] [ 0 1 ... 0 -c_3 ] [ ... ] [ 0 0 ... 1 -c_n ] is upper Hessenberg, and a companion matrix, i.e. its characteristic polynomial is simply p(x) = x^n + sum_(i=1 to n) c_i*x^(i-1). So just by matrix-vector multiplication we have reduced A to a simple form, and could imagine using C to solve A*x=b or find eigenvalues. But this would be a bad idea for two reasons: (1) K is likely to be dense (if y_1 is dense), and solving with K is probably harder than solving with A (2) K is likely to be very ill-conditioned, since it is basically running the power method, so the vectors y_i are converging to the eigenvectors of the largest eigenvalue in absolute value. Krylov subspace methods address these two drawbacks, implicitly or explicitly, as follows: We will not compute K but rather an orthogonal Q such that the leading k columns of K and of Q span the same space, called a Krylov subspace: span{y_1,y_2,...,y_k} = span{y_1,A*y_1,...,A^(k-1)*y_1} = {script K}_k(A,y_1) The relationship between K and Q is simply the QR decomposition: K = Q*R. Furthermore, we won't compute all n columns of K and Q, but just the first k << n columns, and settle for the "best" approximate solution x of A*x=b or A*x = lambda*x that we can find in the Krylov subspace they span (the definition of "best" depends on the algorithm). To proceed, substitute K = Q*R into C = K^{-1}*A*K to get C = R^{-1)*Q^T*A*Q*R or (*) R*C*inv(R) = Q^T*A*Q = H since R and so R^{-1} are triangular, and C is upper Hessenberg, then H is also upper Hessenberg. If A is symmetric, then so is H, and so H is tridiagonal. To see how to compute the columns of Q one at a time, write Q = [q_1,q_2,...,q_n], rewrite (*) as A*Q = Q*H and equate the j-th column on both sides to get A*q_j = sum_{i=1 to j+1} q_i*H(i,j) Since the q_j are orthonormal, multiply both sides of the last equality by q_m^T to get q_m^T*A*q_j = sum_{i=1 to j+1} H(i,j)*q_m^T*q_j = H(m,j) for 1 <= m <= j and so H(j+1,j) * q_(j+1) = A*q_j - sum_{i=1 to j} q_i*H(i,j) This yields the following algorithm: Arnoldi algorithm for (partial) reduction of A to Hessenberg form: q_1 = b / ||b||_1 for j = 1 to k z = A*q_j for i = 1 to j ... run Modified Gram-Schmidt (MGS) on z H(i,j) = q_i^T*z z = z - H(i,j)*q_i end for H(j+1,j) = ||z||_2 q_(j+1) = z / H(j+1,j) end for The q_j vectors are called Arnoldi vectors, and the cost is k multiplications by A, plus O(k^2*n) flops for MGS. If we stop the algorithm here, at k < n, what have we learned about A? Write Q = [Q_k,Q_u] where Q_k = [q_1,...,q_k] is computed as well as the first column q_{k+1} of Q_u; the other columns of Q_u are unknown. Then H = Q^T*A*Q = [Q_k,Q_u]^T * A * [Q_k,Q_u) = [ Q_k^T*A*Q_k Q_k^T*A*Q_u ] = [ H_k H_uk ] [ Q_u^T*A*Q_k Q_u^T*A*Q_u ] [ H_ku H_u ] Since H is upper Hessenberg, so are H_k and H_u, and H_ku has a single nonzero entry in its upper right corner, namely H(k+1,k). So H_uk and H_u are unknown, and H_k and H_ku are known. When A is symmetric, so that H is symmetric and tridiagonal, we may write H = T = [ alpha_1 beta_1 0 ... ] [ beta_1 alpha_2 beta_2 0 ... ] [ 0 beta_2 alpha_3 beta_3 0 ... ] [ ... ] [ ... 0 beta_(n-1) alpha_n ] Equating column j on both sides of A*Q = Q*T yields A*q_j = beta_(j-1)*q_(j-1) + alpha_j*q_j + beta_j*q_(j+1) Multiplying both sides by q_j^T yields q_j^T*A*q_j = alpha_j leading to the following simpler version of Arnoldi algorithm, called Lanczos: Lanczos algorithm for (partial) reduction of A=A^T to tridiagonal form q_1 = b / ||b||_1, beta_0 = 0, q_0 = 0 for j = 1 to k z = A*q_j alpha_j = q_j^T*z z = z - alpha_j*q_j - beta_(j-1)*q_(j-1) beta_j = ||z||_2 q_(j+1) = z / beta_j end for Here is some notation to describe the Arnoldi and Lanczos algorithms. As above, the space spanned by {q_1,...,q_k} is called a Krylov subspace: {script K}_k(b,A) = span{q_1,...,q_k} or {script K}_k for short. We call H_k (T_k for Lanczos) the projection of A onto {script K}_k. Our goal is to solve A*x=b or A*x = lambda*x by somehow finding the "best" solution in the Krylov subspace. For the eigenvalue problem, the answer is simple: use the eigenvalues of H_k (or T_k) as approximate eigenvalues of A. To see what the error is, suppose H_k*y = lambda*y and write H * [ y ] = [ H_k H_uk] * [ y ] = [ H_k*y ] = [ lambda*y ] [ 0 ] [ H_ku H_u ] [ 0 ] [ H_ku*y] [ H(k+1,k)*y_k*e_1 ] and since A*Q = Q*H A * (Q * [ y ] ) = lambda * (Q * [ y ] ) + q_(k+1)*H(k+1,k)*y_k [ 0 ] [ 0 ] so if the product of H(k+1,k)*y_k is small, the eigenvalue/vector pair (lambda, y') = (lambda, Q*[y ; 0]) has a small residual || A*y' - lambda*y' ||_2 = |H(k+1,k)*y_k| When A is symmetric, we know from Chapter 5 that this is in fact a bound on the error in lambda. To see how the eigenvalues of T_k approximate the eigenvalues of A as k grows, we consider the symmetric case (Lanczos), and plot the eigenvalue of T_k versus k (Figure 7.2: LanczosConvergence9.eps and LanczosConvergence29.eps). Chapter 7 discusses finding eigenvalues and eigenvectors in more detail. Now we return to Chapter 6, on solving A*x=b. Given Q_k = [q_1,...,q_k] our goal is to find the "best" approximate solution x_k in span(Q_k). To make progress, we need to define "best"; depending on what we choose, we end up with different algorithms: (1) Choose x_k to minimize || x_k - x ||_2, where x is the true solution. Unfortunately, we don't have enough information in Q_k and H_k (or T_k) to compute this. (2) Choose x_k to minimize the norm || r_k ||_2 of the residual r_k = b-A*x_k. We can do this, and the algorithm is called MINRES (for "minimum residual") when A is symmetric, and GMRES (for "generalized minimum residual") when it is not. (3) Choose x_k so that r_k is perpendicular to {script K}_k, i.e. r_k^T * Q_k = 0. This is called the orthogonal residual property, or a Galerkin condition, by analogy to finite elements. When A is symmetric, the algorithm is called SYMMLQ. When A is nonsymmetric, a variant of GMRES works. (4) When A is spd, it defines a norm || r ||_{A^(-1)} = (r^T * A^(-1) * r)^(1/2). We say the best solution minimizes || r_k ||_{A^(-1)}. Note that || r_k ||_A^(-1)^2 = r_k^T*A^(-1)*r_k = (b-A*x_k)^T*A^{-1)*(b-A*x_k) = (A*x-A*x_k)^T*A^{-1)*(A*x-A*x_k) = (x-x_k)^T*A*(x-x_k) = || x - x_k ||_A^2 This algorithm is called the Conjugate Gradient Algorithm. Thm: When A is spd, definitions (3) and (4) of "best" are equivalent, and using the Conjugate Gradient algorithm, it is possible to compute x_k from previous iterates for the cost of one multiplication by A, and a small number of dot products and saxpys (y = alpha*x + y), keeping only 3 vectors in memory. More generally, the choice of algorithm depends on the properties of A: Decision Tree: Figure 6.8 (IterativeTreeAx=b.ps) See also pointer on class web page to Templates for the solution of Linear Systems to an expanded version of this tree, and pointers to software. We now discuss GMRES, which uses the least structure of the matrix, and then CG, which uses the most. In GMRES, at each step we choose x_k to minimie || r_k ||_2 = || b - A*x_k ||_2 where x_k = Q_k*y_k, i.e. x_k is in {script K}_k(b,A). Thus we choose y_k to minimize || r_k ||_2 = || b - A*Q_k*y_k ||_2 = || b - A*[Q_k,Q_u]*[ y_k; 0 ] ||_2 = || Q^T*( b - A*[Q_k,Q_u]*[ y_k; 0 ] ) ||_2 = || Q^T*b - H*[ y_k; 0 ] ) ||_2 = || e_1 * ||b||_2 - [ H_k H_uk ] * [ y_k ] ||_2 [ H_ku H_u ] [ 0 ] = || e_1 * ||b|_2| - [ H_k ] * y_k ||_2 [ H_ku ] Since only the first row of H_ku is nonzero, we just need to solve the k-by-(k+1) upper Hessenberg least squares problem || r_k ||_2 = || e_1 * ||b||_2 - [ H_k ] * y_k ||_2 || [ 0...0 H(k+1,k) ] which can be done inexpensively with Givens rotations, exploiting the upper Hessenberg structure, costing just O(k^2) instead of O(k^3). Now we return to CG, and begin by showing that when A is spd, it is "best" in two senses: Lemma: When A is spd, (3) and (4) are equivalent: (3) Choose x_k so that r_k^T * Q_k = 0. (4) Choose x_k to minimize || r_k ||_A^(-1) = || x_k - x ||_A and are solved by x_k = Q_k*T_k^(-1)*e_1*||b||_2 , where T_k is the tridiagonal matrix computed by Lanczos. Also r_k = +- ||r_k||_2*q_(k+1) Proof: Drop the subscript k for simpler notation, so Q=Q_k, T=T_k, x = Q*T^(-1)*e_1*||b||_2, r = b-A*x and T = Q*A*Q^T is nonsingular, by Question 6.1 (due today!). Then we need to confirm that Q^T * r = Q^T * (b - A*x) = Q^T*b - Q^T*A*x = e_1*||b||_1 - Q^T*A*Q*T^(-1)*e_1*||b||_2 = e_1*||b||_1 - T*T^(-1)*e_1*||b||_2 = 0 as desired Now we need show that this choice of x minimizes || r ||_A^{-1}^2, so consider a different x' = x + Q*z and r' = b - A*x' = r - A*Q*z || r' ||_A^(-1)^2 = r' * A^(-1) * r = (r - A*Q*z)^T*A^(-1)*(r - A*Q*z) = r^T*A^(-1)*r - 2(A*Q*z)^T*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z) = r^T*A^(-1)*r - 2*z^T*Q^T*A*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z) = r^T*A^(-1)*r - 2*z^T*Q^T*r + (A*Q*z)^T*A^(-1)*(A*Q*z) = r^T*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z) since Q^T*r = 0 = || r ||_A^(-1)^2 + || A*Q*z ||_A^(-1)^2 and this is minimized by choosing z = 0, so x' = x as desired. Finally note that r_k = b - A*x_k must be in {script K}_(k+1), so that r_k is a linear combination of columns of Q_{k+1} = [q_1,...,q_(k+1)]. But since r_k is perpendicular to the columns of Q_k = [q_1,...,q_k], r_k must be parallel to q_(k+1), so r_k = +-||r_k||_2 * q_(k+1) There are several ways to derive CG. We will take the most "direct" way from the formula above, deriving recurrences for 3 sets of vectors, of which we only need to keep the most recent ones: x_k, r_k, and so-called conjugate gradient vectors p_k. (1) The p_k's are called gradients because each step of CG can be thought of as moving x_(k-1) along the gradient direction p_k (i.e. x_k = x_(k-1) + nu * p_k) until it minimizes || r_k ||_A^(-1) over all choices of the scalar nu. (2) The p_k's are called conjugate, or more precisely A-conjugate, because they are orthogonal in the A inner product: p_k^T*A*p_j = 0 if j neq k. CG is sometimes derived by figuring out recurrences for x_k, r_k and p_k that satisfy properties (1) and (2), and then showing they satisfy the optimality properties in the Lemma. A nice presentation is found in Shewchuk's writeup on the class web page. Instead, we will start with the formula for x_k = Q_k*T_k^(-1)*e_1*||b||_2 , from the lemma, and derive the recurrences from there. Since T_k = Q_k^T*A*Q_k is spd, we can do Cholesky to get T_k = L'_k*L'_k^T where L'_k is lower bidiagonal, and L'_k*L'_k^T = L_k*D_k*L_k^T where L_k has unit diagonal and D_k is diagonal, with D_k(i,i) = L'_k(i,i)^2. Then from the Lemma x_k = Q_k*T_k^(-1)*e_1*||b||_2 = Q_k*(L_k*D_k*L_k^T)^(-1)*e_1*||b||_2 = Q_k*L_k^(-T) * (D_k^(-1)*L_k^(-1)*e_1*||b||_2) = P'_k * y_k where we write P'_k = [p'_1,...,p'_k]. The eventual conjugate gradients p_k will turn out to be scalar multiples of the p'_k. So we know enough to prove property (2) above: Lemma: The p'_k's are A-conjugate, i.e. P'_k^T * A * P'_k is diagonal. Proof: P'_k * A * P'_k = (Q_k*L_k^(-T))^T * A * (Q_k*L_k^(-T)) = L_k^(-1)* Q_k^T*A*Q_k * L_k^(-T) = L_k^(-1)* T_k * L_k^(-T) = L_k^(-1)* (L_k * D_k * L_k^T) * L_k^(-T) = D_k Now we derive simple recurrences for the column p'_k of P'_k, and the components of y_k, which in turn give us a recurrence of x_k. We will show that the P'_(k-1) is identical to the leading k-1 columns of P'_k: P'_k = [P'_(k-1),p'_k] and similarly for y_(k-1) = [s_1; ... ;s_(k-1)] and y_k = [s_1;...;s_(k-1);s_k]. Assuming these are true for a moment, they will give us the recurrence for x_k: (Rx) x_k = P'_k*y_k = [P'_(k-1),p'_k]*[s_1;...;s_k] = P'^(k-1)*[s_1;...;s_(k-1)] + p'_k*s_k = x_(k-1) + p'_k*s_k assuming we can also get recurrences for p'_k and s_k. Since Lanczos constructs T_k row by row so T_(k-1) is the leading k-1 by k-1 submatrix of T_k, and Cholesky also computes row by row, we get that L_(k-1) and D_(k-1) are the leading k-1 by k-1 submatrices of L_k and D_k, resp: T_k = L_k * D_k * L_k^T = [ L_(k-1) 0 ] * [ D_(k-1) 0 ] * [ L_(k-1) 0 ]^T [ 0...0 l_(k-1) 1 ] [ 0 d_k] [ 0...0 l_(k-1) 1 ] so L_k^(-1) = [ L_(k-1)^(-1) 0 ] [ stuff 1 ] and y_k = D_k^(-1)*L_k^(-1)*e_1*||b||_2 = [ D_(k-1)^(-1) 0 ] * [ L_(k-1)^(-1) 0 ] * e_1 * ||b||_2 [ 0 d_k^(-1) ] [ stuff 1 ] = [ D_(k-1)^(-1) * L_(k-1)^(-1) * e_1 * ||b||_2 ] [ s_k ] = [ y_(k-1) ] [ s_k ] as the desired recurrence for y_k. Similary P'_k = Q_k * L_k^(-T) = [ Q_(k-1) , q_k ] * [ L_(k-1)^(-T) stuff ] [ 0 1 ] = [ Q_(k-1)*L_(k-1)^(-T) , p'_k ] = [ P'_(k-1) , p'_k ] so to get a formula for p'_k, write Q_k = P'_k * L_k^T, and equating the last columns we get q_k = p'_k + p'_(k-1)*L_k(k,k-1) or (Rp) p'_k = q_k - l_(k-1)*p'_(k-1) as the desired recurrence for p'_k. Finally we need a recurrence for r_k from (Rx): (Rr) r_k = b - A*x_k = b - A*(x_(k-1) + p'_k*s_k) = r_(k-1) - A*p'_k*s_k Putting these vector recurrences together we get (Rr) r_k = r_(k-1) - A*p'_k*s_k (Rx) x_k = x_(k-1) + p'_k*s_k (Rp) p'_k = q_k - l_(k-1)*p'_(k-1) To get closer to the final recurrences, substitute q_k = r_(k-1)/||r_(k-1)||_2 and p_k = ||r_(k-1)||_2 * p'_k to get (Rr') r_k = r_(k-1) - A*p_k*(s_k/||r_(k-1)||_2) = r_(k-1) - A*p_k*(nu_k) (Rx') x_k = x_(k-1) + p_k*nu_k (Rp') p_k = r_(k-1) - (||r_(k-1)||_2 * l_(k-1) / ||r_(k-2)||_2) * p_(k-1) = r_(k-1) + mu_k * p_(k-1) We still need formulas for the scalars mu_k and nu_k. There are several choices, some more numerically stable than others. See the text for the algebra, we just write here nu_k = r_(k-1)^T*r_(k-1) / p_k^T*A*p_k mu_k = r_(k-1)^T*r_(k-1) / r_(k-2)^T*r_(k-2) Putting it all together, we get Conjugate Gradient Method for solving Ax=b: k= 0 , x_0 = 0, r_0 = b, p_1 = b repeat k = k+1 z = A*p_k nu_k = r_(k-1)^T*r_(k-1) / p_k^T*z x_k = x_(k-1) + p_k*nu_k r_k = r_(k-1) - nu_k*z mu_(k+1) = r_k^T*r_k / r_(k-1)^T*r_(k-1) p_(k+1) = r_k + mu_(k+1) * p_k until || r_k ||_2 small enough Note that minimization property (1) above follows from the fact that since x_k minimizes || r_k ||_A^(-1) over all possible x_k in Q_k, it must certainly also satisfy the lower dimensional minimization property (1).