Notes for Math 221, Lecture 4, Sep 8 2009 Goal: Gaussian Elimination Overall plan: Derive basic algorithm (twice) Need for Pivoting to keep backward error small Error Analysis and practical error bounds How to implement matmul and then GE efficiently Important variations exploiting structure (spd, banded, sparse,...) Iterative refinement to improve an approximate answer Def Permutation matrix: identity with permuted rows Facts: let P, P1 etc be permutation matrices P*X = row permutation of X X*P = col permutation of X P1*P2=P3 inv(P) = P^T (enough to check that diag(P*P^T) = all ones det(P) = +-1 P is an orthogonal matrix to store and multiply by P, just keep track of order (cheap) Algorithm: GE with pivoting 1) Factor A=P*L*U where P is a perm, L unit lower triangular, U upper triangular (cost = (2/3)n^3 +O(n^2) flops, the hard part) 2) Solve P*L*U*x=b for L*U*x = P^T*x (permute) (no flops) 3) Solve L*U*x=P^T*X for U*x = inv(L)*P^T*x (substitution) (cost = n^2 flops) 4) Solve U*x=inv(L)*P^T*X for x = inv(U)*inv(L)*P^T*x (substitution) (cost = n^2 flops) Note: We do not compute inv(A) and multiply by it! This both takes 3x as many flops for a dense matrix and is less accurate Note: once we have A = P*L*U, handling more right-hand-sides b is cheap, O(n^2) Will derive algorithm for 1) in couple of ways, show they are equivalent First will deal with case where don't need P (or imagine we start with P^T*A, and compute P^TA = L*U) Approach 1: Start with how you may have learned it long ago: "Take each row of the matrix, and add multiples of it to later rows to zero out each column below the diagonal, until you get a triangular matrix U" for i = 1:n-1 ... for each column i, zero it out below diagonal for j = i+1:n ... for each row j below i m = A(j,i)/A(i,i) ... subtract a multiple m of row i from row j for k = i to n ... for each possibly nonzero entry of row j A(j,k) = A(j,k) - m*A(i,k) Let's "optimize" this: (1) don't bother computing zero entries: line 4 becomes for k=i+1 to n, so A(j,i) left alone (2) save multipliers m in these locations that would have been zeroed out: line 3 becomes A(j,i) = A(j,i)/A(i,i) line 5 becomes A(j,k) = A(j,k) - A(j,i)*A(i,k) (3) split into two loops: compute all the A(j,i) first, then use them: for i = 1:n ... for each column i, zero it out below diagonal for j = i+1:n ... for each row j below i A(j,i) = A(j,i)/A(i,i) ... add a multiple m of row i to row j for j = i+1:n ... for each row j below i for k = i to n ... for each entry of row j A(j,k) = A(j,k) - A(j,i)*A(i,k) (4) use Matlab notation for loops for i = 1:n A(i+1:n) = A(i+1:n)/A(i,i) ... scale column by (1/A(i,i)) A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i) * A(i, i+1:n) ... subtract rank-1 matrix from trailing n-i x n-i submatrix (draw picture) After running this algorithm, A has been overwritten by new entries. Let upper triangular part of result be called U Let strict lower triangular part of result, containing saved multipliers, be called M, so L = I+M is unit lower triangular Thm: Assuming we don't divide by 0, A = L*U Proof: derive same algorithm differently Def: Leading Principal Submatrix of A is any A(1:i,1:i) Thm: A=LU with L unit lower triangular and U upper triangular and nonsingular exists and is unique iff all Leading Principal Submatrices (LPSs) of A nonsingular Corollary: If A nonsingular there exist permutations P1 and P2 such that P1*A*P2 = L*U (only need one of P1 or P2, not both) where L is unit lower triangular and U is nonsingular upper triangular Standard algorithm as embodiment of proof of part of one direction (A=LU exists) Notation for block matrix multiplication, to be used frequently: A * B = C can be written [ A11 A12 ] * [ B11 B12 ] = [ C11 C12 ] [ A21 A22 ] [ B21 B22 ] [ C21 C22 ] where Cij = Ai1*B1j + Ai2+B2j, true as long as dimensions compatible: #rows(Cij) = #rows(Aik) for all i,j,k #cols(Cij) = #cols(Bkj) for all i,j,k #cols(Aik) = #rows(Bkj) for all i,j,k Write A = [ A11 A12 ] = [ 1 0 ] * [ U11 U12 ] = [ U11 U12 ] [ A21 A22 ] [ L21 I ] [ 0 U22 ] [ L21*U11 L21*U12 + U22 ] where A11 and U11 1 x 1, A11 nonzero by assumption, A21 and L21 (n-1) x 1 A12 and U12 1 x (n-1) A22 and U22 (n-1) x (n-1) For this "2x2 block triangular factorization" to hold we need to set U11 = A11 U12 = A12 A21 = L21*U11 A22 = L21*U12 * U22 or L21 = A21/A11 U22 = A22 - L21*U12 Notation: A11 called "pivot", U22 called "Schur complement" Want to use induction on U22, so need to show that all LPSs of U22 are nonsingular (draw picture) Write above factorization as A = Lhat*Uhat, so 0 neq det(A(1:i,1:i)) = det(Lhat(1:i,1:i)*Uhat(1:i,1:i)) = det(Lhat(1:i,1:i))*det(Uhat(1:i,1:i)) = 1 * U11 * det(U22(1:i-i,1:i-i)) so det(U22(1:i-1,1:i-1)) is nonzero as desired Then by induction U22 = L' * U' and we get A = [ A11 A12 ] = [ 1 0 ] * [ U11 U12 ] = [ 1 0 ] * [ U11 U12 ] [ A21 A22 ] [ L21 I ] [ 0 L'*U' ] [ L21 L' ] [ 0 U' ] = unit lower triangular * upper triangular as desired What happens if we try to write this as algorithm? get same as before for i=1:n ... for each step of induction L21 = A21 / A11 ... compute L21, i.e. A(i+1:n,i) = A(i+1:n,i) / A(i,i) ... overwrite L on A as before U22 = A22 - L21 * U12 ... compute U22, i.e. A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i) * A(i,i+1:n) ... A turns into U What about uniqueness? L1*U1 = L2*U2 implies inv(L2)*L1 = U2*inv(U1) or unit lower triangular = upper triangular, so both identity What about other direction, all LPS nonsingular? Get LU decomp of each submatrix, and det(L(1:n,1:i))*det(U(1:i,1:i)) = det(A(1:i,1:i)) or nonzero * nonzero = nonzero Next: need permutations when A just nonsingular Induction again: If A nonsingular, it has a nonzero somewhere in first column (resp row) so permute rows (resp columns) so that (1,1) entry of P1*A (resp. A*P2) is nonzero (only one of P1 and P2 necessary) Do 2x2 block triangular factorization as before P1*A*P2 = [ A11 A12 ] = [ 1 0 ] * [ U11 U12 ] [ A21 A22 ] [ L21 I ] [ 0 U22 ] so U11 = A11, U12 = A12, L21 = A21/A11, U22 = A22-L21*U12 Apply induction to U22, since det(A) = +- U11*det(U22) nonzero, so P1'*U22*P2' = L'*U' or P1*A*P2 = [ 1 0 ] * [ U11 U12 ] [ L21 I ] [ 0 P1'^T*L'*U'*P2'^T ] = [ 1 0 ] * [ U11 U12 ] [ L21 P1'^T*L' ] [ 0 U'*P2'^T ] = [ 1 0 ] * [ 1 0 ] * [ U11 U12 ] [ 0 P1'^T ] [ P1'*L21 L' ] [ 0 U'*P2'^T ] = [ 1 0 ] * [ 1 0 ] * [ U11 U12*P2' ] * [ 1 0 ] [ 0 P1'^T ] [ P1'*L21 L' ] [ 0 U' ] [ 0 P2'^T ] or [ 1 0 ] * P1 * A * P2 * [ 1 0 ] = [ 1 0 ] * [ U11 U12*P2' ] [ 0 P1' ] [ 0 P2' ] [ P1'*L21 L'] [ 0 U' ] or (perm * perm) * A * (perm * perm) = unit lower triangular * upper triangular as desired. Why bother with P1 and P2? (1) for dense matrices, in practice, just use P1 as described later (2) some rare cases where errors much smaller by picking (1,1) entry from somewhere other than first column, to make it bigger (3) for sparse matrices, need both P1 and P2 to keep L and U sparse and save a lot of work We need to do pivoting, even if we wouldn't divide by zero: Suppose we run in single precision, 7 decimal digits A = [ 1e-8 1 ], inv(A) ~ [ -1 1 ] , [ 1 1 ] [ 1 -1e-8] cond(A) ~ 2.6, really small, so should get accurate answers L = [ 1 0 ] U = [ 1e-8 1 ] [ 1e8 1 ] [ 0 fl(1-1e8*1) = -1e8 ] but L*U = [ 1e-8 1 ] [ 1 0 ] (2,2) wrong, same answer for A(2,2)=-1, 0, +1... Answer L,U nearly independent of A(2,2): "numerical instability" L, U not the exact factors of a nearby problem Ex: Solve A*x = [1;2]; start by solving L*y = [1;2] so y(1) = fl(1/1) = 1 y(2) = fl(2 - 1e8*1) = -1e8, (nearly) independently of b(2), so would get same solution for many different b(2): can't all have small error! Another way to see why bad: cond(L), cond(U) = 1e8, much bigger than cond(A) If we permute so A(1,1)=1, then full accuracy. Intuition: we want to pick a large entry of A to be pivot A11 Error Analysis We want backward stability, ie A+E = P*L*U with E "small", ie norm(E)=O(macheps)*norm(A) Will see that we will need to avoid large values of L(j,i)*U(i,k), that are much larger than original entries of A: Turns out that by looking at algorithm, each entry of U and L computed mostly as a dot product, which we analyzed in Q 1.10: track where U(j,k) comes from: (picture): for j <= k, we compute fl[ A(j,k) - L(j,1)*U(1,k) - L(j,2)*U(2,k) - ... - L(j,j-1)*U(j-1,k) ] = U(j,k) apply analysis of dot product (Q1.10) to get U(j,k) = (1+e)*[ A(j,k) - fl(sum_{i=1:j-1} L(j,i)*U(i,k)) ] where |e| <= macheps = (1+e)*[ A(j,k) - sum_{i=1:j-1} L(j,i)*U(i,k)*(1+d(i)) ] where |d_{j,k}| <= (j-1)*macheps Rewrite this as A(j,k) = exact dot product + (small) error: A(j,k) = U(j,k)/(1+e) + sum_{i=1:j-1} L(j,i)*U(i,k)*(1+d(i)) ~ U(j,k)*(1-e) + sum_{i=1:j-1} L(j,i)*U(i,k)*(1+d(i)) = U(j,k) + sum_{i=1:j-1} L(j,i)*U(i,k) -e*U(j,k) + sum_{i=1:j-1} L(j,i)*U(i,k)*d(i) = (L*U)(j,k) + E(j,k) where |E(j,k)| = |-e*U(j,k) + sum_{i=1:j-1} L(j,i)*U(i,k)*d(i)| <= [ |U(j,k)| + sum_{i=1:j-1} |L(j,i)*U(i,k)| ]*(j-1)*macheps = (|L|*|U|)(j,k) * n*macheps Similar analysis applies when j > k to get L(j,k) = fl[ ( A(j,k) - L(j,1)*U(1,k) - ... - L(j,k-1)*U(k-1,k) )/U(k,k) ] (1+e1)(1+e2)( A(j,k) - fl(sum_{i=1:k-1} L(j,i)*U(i,k)) )/U(k,k) solve for A(j,k) to get A(j,k) = L(j,k)*U(k,k)/((1+e1)*(1+e2)) + fl(sum_{i=1:k-1} L(j,i)*U(i,k) ) = ... = (L*U)(j,k) + E(j,k) where |E(j,k)| <= (|L|*|U|)(j,k) * n*macheps again Altogether: A = L*U + E where |E| <= |L|*|U| * n*macheps componentwise Putting together whole solution (from Q1.11), since there are also error from triangular solves: Solving Ly=b => (L+dL)yhat = b where |dL| <= n*macheps*|L| Solving Ux=y => (U+dU)xhat = yhat where |dU| <= n*macheps*|U| Combine b = (L+dL)*yhat = (L*dL)*(U+dU) * xhat = (L*U + L*dU + dL*U + dL*dU) * xhat = (A - E + L*dU + dL*U + dL*dU ) * xhat = (A + dA)*xhat where |dA| <= |E| + |L*dU| + |dL*U| + |dL*dU| <= |E| + |L|*|dU| + |dL|*|U| + |dL|*|dU| = (3*n*macheps+O(macheps^2)) *|L|*|U| ~ 3*n*macheps * |L|*|U| Is this "backward stable"? I.e. is norm(dA) = O(macheps)*norm(A)? Need to compare norm(|L|*|U|) with norm(A) Def: "Pivot growth factor" = g = norm(|L|*|U|)/norm(A) Fact: g >= 1 (for most norms) Then ||dA|| <= 3*n*macheps*g* ||A|| so we can compare xhat from (A+dA)*xhat=b with true solution x from A*x=b: Thm: || xhat - x ||/||x|| <= 3*n*macheps*g*kappa(A) + O(macheps^2) Proof: from before, we bound || xhat - x ||/||x|| <= kappa(A) * ||dA||/||A|| Whether this is satisfactory depends on g: Bad example: A = [1e-8 1] from above: [ 1 1] |L|*|U| = [ 1 0 ] * [ 1e-8 1 ] = [ 1e-8 1 ] [ 1e8 1 ] [ 0 |-1e8| ] [ 1 2e8 ] and largest entry 2e8 is much bigger than A (g ~ 1e8): why we get the wrong answer Idea: Pick pivot A11 "large" so when we divide by it, entries of L are small: (1) Simplest, and standard, approach, used in most libraries: "Partial pivoting" (GEPP) Permute rows only so A11 largest available entry in column Then L21 = A21/A11 has |L21| <= 1 Theorem (easy): with GEPP, |L| <= 1 and |U| <= 2^(n-1} Bad news: worst case is terrible; even for n=24 in singular precision, all wrong Good news: hardly ever happens (only very small family of matrices where this occurs) Empirical observation, with some justification: gPP < n^(2/3) If all entris of matrix were "random", this would be true as you perform pivoting, they seem to get more random (2) Complete pivoting permute rows and columns so that A11 largest entry in whole matrix Theorem: gCP < n^(log n /4) Empirical: gCP < n^(1/2) Long-standing Conjecture: gCP < n (false, but nearly true) More expensive, hardly used, not in most libraries (3) Communication-avoiding pivoting - something new, introduce it when we talk about algorithms that minimize communication In practice, what if GEPP not good enough? Use QR instead (later) Finally, we note that 3*n*macheps*g*norm(A) is an upper bound, unlikely to be attained; see figures 2.1 and 2.2 for plots of actual backward error (which we know is ||E||/||A|| = ||A*xhat-b||/||A||*||xhat||) on random matrices; nearly always about macheps, i.e. doesn't grow with n. Estimate error bound: need to estimate norm(inv(A)), given L and U goals: cheap, accurate can't have both (Theorem of D., Diament, Malajovich), settle for cheap idea: use || inv(A) || = max_{||x|| = 1} || inv(A)*x || = max_{||x||<= 1} || inv(A)*x || and do gradient ascent ("go uphill") on || inv(A)*x || on convex set ||x|| <= 1; for right choice of norm easy to figure out ascent direction, each step requires solving Ax=b for some b, only costs O(n^2), usually takes at most 5 steps or so to stop ascending so total costs O(n^2); see text for details Where to find all this? Matlab: A\b or [P,L,U]=lu(A) or condest LAPACK: xGETRF just for GEPP where x = S/D/C/Z xGESV to solve A*x=b xGESVX for condition estimation, more xGECON for condition estimation alone CLAPACK: similar ScaLAPACK: PxGESV, etc