Notes for Ma221 Lecture 15, Apr 21, 2022 Krylov Subspace Methods: GMRES and CG Now we return to Chapter 6, on solving A*x=b. Given the orthogonal basis of the Krylov subspace Q_k = [q_1,...,q_k], our goal is to find the "best" approximate solution x_k in span(Q_k). To make progress, we need to define "best"; depending on what we choose, we end up with different algorithms: (1) Choose x_k to minimize || x_k - x ||_2, where x is the true solution. Unfortunately, we don't have enough information in Q_k and H_k = Q_k^T*A*Q^k (or T_k when A is symmetric) to compute this. (2) Choose x_k to minimize the norm || r_k ||_2 of the residual r_k = b-A*x_k. We can do this, and the algorithm is called MINRES (for "minimum residual") when A is symmetric, and GMRES (for "generalized minimum residual") when it is not. (3) Choose x_k so that r_k is perpendicular to {script K}_k, i.e. r_k^T * Q_k = 0. This is called the orthogonal residual property, or a Galerkin condition, by analogy to finite elements. When A is symmetric, the algorithm is called SYMMLQ. When A is nonsymmetric, a variant of GMRES works. (4) When A is spd, it defines a norm || r ||_{A^(-1)} = (r^T * A^(-1) * r)^(1/2). We say the best solution minimizes || r_k ||_{A^(-1)}. Note that || r_k ||_A^(-1)^2 = r_k^T*A^(-1)*r_k = (b-A*x_k)^T*A^{-1)*(b-A*x_k) = (A*x-A*x_k)^T*A^{-1)*(A*x-A*x_k) = (x-x_k)^T*A*(x-x_k) = || x - x_k ||_A^2 This algorithm is called the Conjugate Gradient Algorithm. Thm: When A is spd, definitions (3) and (4) of "best" are equivalent, and using the Conjugate Gradient algorithm, it is possible to compute x_k from previous iterates for the cost of one multiplication by A, and a small number of dot products and saxpys (y = alpha*x + y), keeping only 3 vectors in memory. More generally, the choice of algorithm depends on the properties of A: Decision Tree: Figure 6.8 (IterativeTreeAx=b.ps) See also pointer on class web page to the on-line book Templates for the Solution of Linear Systems for an expanded version of this tree, and pointers to software. We now discuss GMRES, which uses the least structure of the matrix, and then CG, which uses the most. In GMRES, at each step we choose x_k to minimize || r_k ||_2 = || b - A*x_k ||_2 where x_k = Q_k*y_k, i.e. x_k is in {script K}_k(A,b). Thus we choose y_k to minimize || r_k ||_2 = || b - A*Q_k*y_k ||_2 = || b - A*[Q_k,Q_u]*[ y_k; 0 ] ||_2 = || Q^T*( b - A*[Q_k,Q_u]*[ y_k; 0 ] ) ||_2 = || Q^T*b - H*[ y_k; 0 ] ) ||_2 = || e_1 * ||b||_2 - [ H_k H_uk ] * [ y_k ] ||_2 [ H_ku H_u ] [ 0 ] = || e_1 * ||b||_2 - [ H_k ] * y_k ||_2 [ H_ku ] Since only the first row of H_ku is nonzero, we just need to solve the (k+1)-by-k upper Hessenberg least squares problem || r_k ||_2 = || e_1 * ||b||_2 - [ H_k ] * y_k ||_2 || [ 0...0 H(k+1,k) ] which can be done inexpensively with k Givens rotations, exploiting the upper Hessenberg structure, costing just O(k^2) or O(k) instead of O(k^3). Now we return to CG, and begin by showing that when A is spd, it is "best" in two senses: Lemma: When A is spd, (3) and (4) are equivalent: (3) Choose x_k so that r_k^T * Q_k = 0. (4) Choose x_k to minimize || r_k ||_A^(-1) = || x_k - x ||_A which are solved by (*) x_k = Q_k*T_k^(-1)*Q_k^T*b = Q_k*T_k^(-1)*e_1*||b||_2 , where T_k is the tridiagonal matrix computed by Lanczos. Also r_k = +- ||r_k||_2*q_(k+1) . Here is some intuition for (*): Multiplying Q_k^T*b = e_1*||b||_2 projects b onto the Krylov subspace spanned by Q_k T_k^(-1) is the inverse of the projection of A onto the Krylov subspace Multiplying by Q_k maps back from the Krylov subspace to R^n Proof: Drop the subscript k for simpler notation, so Q=Q_k, T=T_k, x = Q*T^(-1)*e_1*||b||_2, r = b-A*x and T = Q^T*A*Q is spd (and so nonsingular), since A is. Then we need to confirm that Q^T * r = Q^T * (b - A*x) = Q^T*b - Q^T*A*x = e_1*||b||_2 - Q^T*A*Q*T^(-1)*e_1*||b||_2 = e_1*||b||_2 - T*T^(-1)*e_1*||b||_2 = 0 as desired Now we need to show that this choice of x minimizes || r ||_A^{-1}^2, so consider a different x' = x + Q*z and r' = b - A*x' = r - A*Q*z || r' ||_A^(-1)^2 = r'^T * A^(-1) * r' = (r - A*Q*z)^T*A^(-1)*(r - A*Q*z) = r^T*A^(-1)*r - 2(A*Q*z)^T*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z) = r^T*A^(-1)*r - 2*z^T*Q^T*A*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z) = r^T*A^(-1)*r - 2*z^T*Q^T*r + (A*Q*z)^T*A^(-1)*(A*Q*z) = r^T*A^(-1)*r + (A*Q*z)^T*A^(-1)*(A*Q*z) since Q^T*r = 0 = || r ||_A^(-1)^2 + || A*Q*z ||_A^(-1)^2 and this is minimized by choosing z = 0, so x' = x as desired. Finally note that r_k = b - A*x_k must be in {script K}_(k+1), so that r_k is a linear combination of columns of Q_{k+1} = [q_1,...,q_(k+1)]. But since r_k is perpendicular to the columns of Q_k = [q_1,...,q_k], r_k must be parallel to q_(k+1), so r_k = +-||r_k||_2 * q_(k+1) (We pause the recorded lecture here.) There are several ways to derive CG. We will take the most "direct" way from the formula (*) above, deriving recurrences for 3 sets of vectors, of which we only need to keep the most recent ones: x_k, r_k, and so-called conjugate gradient vectors p_k. (1) The p_k's are called gradients because each step of CG can be thought of as moving x_(k-1) along the gradient direction p_k (i.e. x_k = x_(k-1) + nu * p_k) until it minimizes || r_k ||_A^(-1) over all choices of the scalar nu. (2) The p_k's are called conjugate, or more precisely A-conjugate, because they are orthogonal in the A inner product: p_k^T*A*p_j = 0 if j neq k. CG is sometimes derived by figuring out recurrences for x_k, r_k and p_k that satisfy properties (1) and (2), and then showing they satisfy the optimality properties in the Lemma. A nice presentation is found in Shewchuk's writeup on the class web page. Instead, we will start with the formula for x_k = Q_k*T_k^(-1)*e_1*||b||_2 , from the lemma, and derive the recurrences from there. Since T_k = Q_k^T*A*Q_k is spd, we can do Cholesky to get T_k = L'_k*L'_k^T where L'_k is lower bidiagonal, and L'_k*L'_k^T = L_k*D_k*L_k^T where L_k has unit diagonal and D_k is diagonal, with D_k(i,i) = L'_k(i,i)^2. Then from the Lemma x_k = Q_k*T_k^(-1)*e_1*||b||_2 = Q_k*(L_k*D_k*L_k^T)^(-1)*e_1*||b||_2 = Q_k*L_k^(-T) * (D_k^(-1)*L_k^(-1)*e_1*||b||_2) = P'_k * y_k where we write P'_k = [p'_1,...,p'_k]. The eventual conjugate gradients p_k will turn out to be scalar multiples of the p'_k. So we know enough to prove property (2) above: Lemma: The p'_k's are A-conjugate, i.e. P'_k^T * A * P'_k is diagonal. Proof: P'_k^T * A * P'_k = (Q_k*L_k^(-T))^T * A * (Q_k*L_k^(-T)) = L_k^(-1)* Q_k^T*A*Q_k * L_k^(-T) = L_k^(-1)* T_k * L_k^(-T) = L_k^(-1)* (L_k * D_k * L_k^T) * L_k^(-T) = D_k Now we derive simple recurrences for the column p'_k of P'_k, and the components of y_k, which in turn give us a recurrence for x_k. We will show that P'_(k-1) is identical to the leading k-1 columns of P'_k: P'_k = [P'_(k-1),p'_k] and similarly for y_(k-1) = [s_1; ... ;s_(k-1)] and y_k = [s_1;...;s_(k-1);s_k]. Assuming these are true for a moment, they will give us the recurrence for x_k: (Rx) x_k = P'_k*y_k = [P'_(k-1),p'_k]*[s_1;...;s_k] = P'_(k-1)*[s_1;...;s_(k-1)] + p'_k*s_k = x_(k-1) + p'_k*s_k assuming we can also get recurrences for p'_k and s_k. Since Lanczos constructs T_k row by row so T_(k-1) is the leading k-1 by k-1 submatrix of T_k, and Cholesky also computes row by row, we get that L_(k-1) and D_(k-1) are the leading k-1 by k-1 submatrices of L_k and D_k, resp: T_k = L_k * D_k * L_k^T = [ L_(k-1) 0 ] * [ D_(k-1) 0 ] * [ L_(k-1) 0 ]^T [ 0...0 l_(k-1) 1 ] [ 0 d_k] [ 0...0 l_(k-1) 1 ] so L_k^(-1) = [ L_(k-1)^(-1) 0 ] [ stuff 1 ] and y_k = D_k^(-1)*L_k^(-1)*e_1*||b||_2 = [ D_(k-1)^(-1) 0 ] * [ L_(k-1)^(-1) 0 ] * e_1 * ||b||_2 [ 0 d_k^(-1) ] [ stuff 1 ] = [ D_(k-1)^(-1) * L_(k-1)^(-1) * e_1 * ||b||_2 ] [ s_k ] = [ y_(k-1) ] [ s_k ] as the desired recurrence for y_k. Similarly P'_k = Q_k * L_k^(-T) = [ Q_(k-1) , q_k ] * [ L_(k-1)^(-T) stuff ] [ 0 1 ] = [ Q_(k-1)*L_(k-1)^(-T) , p'_k ] = [ P'_(k-1) , p'_k ] so to get a formula for p'_k, write Q_k = P'_k * L_k^T, and equating the last columns we get q_k = p'_k + p'_(k-1)*L_k(k,k-1) or (Rp) p'_k = q_k - l_(k-1)*p'_(k-1) as the desired recurrence for p'_k. Finally we need a recurrence for r_k from (Rx): (Rr) r_k = b - A*x_k = b - A*(x_(k-1) + p'_k*s_k) = r_(k-1) - A*p'_k*s_k Putting these vector recurrences together we get (Rr) r_k = r_(k-1) - A*p'_k*s_k (Rx) x_k = x_(k-1) + p'_k*s_k (Rp) p'_k = q_k - l_(k-1)*p'_(k-1) To get closer to the final recurrences, substitute q_k = r_(k-1)/||r_(k-1)||_2 and p_k = ||r_(k-1)||_2 * p'_k to get (Rr') r_k = r_(k-1) - A*p_k*(s_k/||r_(k-1)||_2) = r_(k-1) - A*p_k*(nu_k) (Rx') x_k = x_(k-1) + p_k*nu_k (Rp') p_k = r_(k-1) - (||r_(k-1)||_2 * l_(k-1) / ||r_(k-2)||_2) * p_(k-1) = r_(k-1) + mu_k * p_(k-1) We still need formulas for the scalars mu_k and nu_k. There are several choices, some more numerically stable than others. See the text for the algebra, we just write here nu_k = r_(k-1)^T*r_(k-1) / p_k^T*A*p_k mu_k = r_(k-1)^T*r_(k-1) / r_(k-2)^T*r_(k-2) Putting it all together, we get Conjugate Gradient Method for solving Ax=b: k= 0 , x_0 = 0, r_0 = b, p_1 = b repeat k = k+1 z = A*p_k nu_k = r_(k-1)^T*r_(k-1) / p_k^T*z x_k = x_(k-1) + nu_k*p_k r_k = r_(k-1) - nu_k*z mu_(k+1) = r_k^T*r_k / r_(k-1)^T*r_(k-1) p_(k+1) = r_k + mu_(k+1) * p_k until || r_k ||_2 small enough Note that minimization property (1) above follows from the fact that since x_k minimizes || r_k ||_A^(-1) over all possible x_k in Q_k, it must certainly also satisfy the lower dimensional minimization property (1). As with most other iterative methods (besides multigrid), the convergence rate of CG depends on the condition number cond(A): Thm: || r_k ||_{inv(A)} / || r_0 ||_{inv(A)}} <= 2/(1 + 2*k/(sqrt(cond(A) - 1))) So when cond(A) is large, one needs to take O(sqrt(cond(A))) steps to converge. For d-dimensional Poisson's equation on a grid with n mesh points on a side, cond(A) is about n^2, so CG take O(n) steps to converge, for any dimension d.