Notes for Ma221 Lecture 13, Oct 7 2010 Goals for today: Section 3.1-3.3 (recall SVD) Example: polynomial fitting formulate as minimizing norm of A*x-b, A(i,j) = y(i)^(j-1), where (y(i),b(i)) are sample points run polyfit31 to show results Standard notation argmin_x norm(A*x-b), A mxn, m>n Other variants (code in LAPACK): constrained: argmin_x norm(Ax-b) st Bx=y #rows(B) <= #cols(A) <= #rows(A) + #rows(B) so answer unique if matrices involved are also full rank weighted: argmin_x norm(y) st d = Ax+By weighted because if B = I , it would be usual problem if B square: argmin_x norm(inv(B)*(d-Ax)) underdetermined: If #row(A) < #col(A), then whole set of solutions (add any z st Az=0 to x), so to make unique seek argmin_x norm(x) st Ax=b Algorithms (all are used in various situations) Normal equations - A^T*A*x = A^T*b A^T*A is spd, so solve with Cholesky fastest in dense case (fewest flops, least communication) but not stable if A ill-conditioned QR decomposition A = Q*R Gram-Schmidt - unstable, if A ill-conditioned Householder - stable blocked Householder - also minimizes data movement (also possible to use normal equations approach to get Q explicitly, called CholeskyQR, same pros and cons as normal equations: fast but can be unstable) SVD - most "complete" answer, shows exactly how small changes in A, b could perturb solution x, best when A ill-conditioned conversion to linear system - allows iterative refinement (Q 3.3) Normal equations vanishing gradient of norm(A*x-b,2)^2 = (A*x-b)^T*(A*x-b) compute limit as e->0 of (A*(x+e)-b)^T*(A*(x+e)-b)/||e||^2 why minimizer?: (1) note that Hessian A^TA is positive definite (2) complete square of norm(A*(x+e)-b,2)^2 = (Ax-b - Ae)^T*(Ax-b - Ae) = (Ax-b)^T*(Ax-b) - 2e^T*A^T*(Ax-b) + (Ae)^T(Ae) = ||Ax-b||^2 + ||Ae||^2 picture, use of Pythagorean Thm #flops: m*n^2 (A'*A) + n^3/3 (Cholesky) Gram-Schmidt A = Q*R (dimensions) Assuming we can compute A = Q*R, how do we solve argmin_x norm(A*x-b)? x = R\(Q'*b) Why? proof 1: A = Q*R = [Q,Qhat]*[R;0] where [Q,Qhat] square, orthogonal, and compute || [Q,Qhat]^T*(A*x-b) ||_2 proof 2: Ax - b = Q*R*x -b = Q*R*x - Q*Q'*b + Q*Q'*b - b = Q(R*x-Q'*b) + (Q*Q'-I)*b which are orthogonal proof 3: plug into normal equations Algorithms for computing A=Q*R Classical and Modified Gram-Schmidt for i=1:n tmp = A(:,i) for j=1:i-1 r(j,i) = Q(:,j)'*A(:,i) ... CGS, costs 2m r(j,i) = Q(:,j)'*tmp ... MGS, costs 2m tmp = tmp - r(j,i)*Q(:,j) ... costs 2m end r(i,i) = norm(tmp ,2) ... costs 2m Q(:,i) = tmp /r(i,i) ... costs m end total #flops = 4m*n^2/2 +O(mn) = 2mn^2 + O(mn), about 2x normal equations Two metrics of backward stability: If backward stable should get accurate QR decomposition of slight perturbed input matrix A+E = QR where norm(E)/norm(A) = O(macheps) This means norm(Q*R-A)/norm(A) should be O(macheps), just like we expect norm(A-PLU)/norm(A) to be O(macheps) But it also means norm(Q'*Q-I) should be O(macheps), an extra condition. plots showing instability: run QRStability with cnd = 1e1, 1e2 1e4 1e8, m=6, n=4, samples = 100 Eventually: will explain stable implementation; Also building block for eigenvalue and SVD algorithms Recall SVD = Singular Value Decomposition Thm. Suppose A = m x m, then there is an orthogonal matrix U = [u(1),...,u(m)] diagonal matrix Sigma = diag(sigma(1),...,sigma(m)) with sigma(1) >= sigma(2) >= ... >= 0 orthogonal matrix V = [v(m),...,v(m)] with A = U*Sigma*V^T. u(i) called left singular vectors sigma(i) called singular values v(i) called right singular vectors More generally, if A = m x n with m > n, then U m x m and orthogonal as before V n x n and orthogonal Sigma is m x n with same diagonal as before When m > n, we sometimes write this as [u(1),...,u(n)] * diag(sigma(1),...,sigma(n)) * V^T == Uhat * Sigmahat * V^T Recall Fact 2 from Lecture 3: When m>n, we can use it to solve the full rank least squares problem argmin_x ||A*x-b||_2 as follows: writing A = Uhat*Sigmahat*V' as above then x = V*inv(Sigmahat)*Uhat^T*b Def: If A = Uhat*Sigmahat*V^T is full rank, its (Moore-Penrose) pseudoinverse is A^+ = V*inv(Sigmahat)*Uhat^T Facts: A^+ = inv(A^T*A)*A^T (i.e. SVD and normal equations give same answer!) In matlab, A^+ = pinv(A) (when A not full rank or #cols>#rows, def later) Perturbation theory for least squares problem: If x = argmin_x norm(A*x-b,2), and we change A and b a little, how much can x change? Since square case is special case, expect cond(A) to show up, but there is another source of sensitivity: A = [1;0], b = [0,b2], => x = inv(A^T*A)*A^T*b = 1*0 = 0 A = [1;0], b = [b1,b2], => x = 1*b1 = b1 If b2 very large, b1 could be small compared to b2 but still large, so big change in x More generally, if b (nearly) orthogonal to span(A), then condition number very large Expand e = (x+e) - x = inv((A+dA)^T(A+dA)*(A+dA)^T*(b+db) - inv(A^T*A)*A^T*b, using approximation inv(I-X) = I+X + O(||X||^2) as before only keepiing linear terms (one factor of dA and/or db), call eps = max(norm(dA)/norm(A) , norm(db)/norm(b)) to get norm(e)/norm(x) <= eps*(2*cond(A)/cos(theta) + tan(theta)*cond(A)^2) + O(eps^2) where theta = angle(b,A*x), or sin(theta) = norm(Ax-b,2)/norm(b,2) So condition number can be large if (1) cond(A) large (2) theta near pi/2, i.e. 1/cos(theta) ~ tan(theta) large (3) error like cond(A)^2 when theta not near zero Will turn out that using the right QR decomposition, or SVD, keeps eps = O(machine epsilon), so backward stable Normal equations are not backward stable; since we do Chol(A^T*A) the error bound is always proportional to cond(A)^2, even when theta small. Stable algorithms for QR decomposition: Recall that MGS and CGS produced Qs that were far from orthogonal, even though norm(A-Q*R) was as small as desired. For solving least squares problems (and later eigenproblems and computing the SVD) we need algorithms that keep Q very nearly orthogonal, i.e. norm(Q^T*Q-I) = O(macheps) Note: MGS still has its uses in some other algorithms, CGS does not Basic idea: express Q as product of simple orthogonal matrices Q=Q(1)*Q(2)*... where each Q(i) accomplishes part of the task (multiplying it by A makes some entry zero, until A turn into R). Since a product of orthogonal matrices is orthogonal, there is no danger of losing orthogonality. There are two kinds of simple orthogonal matrices: Householder transformations Givens rotations. A Householder transformation (or reflection) is of the form H = I-2*u*u^T, where u is a unit vector. We confirm orthogonality as follows: H*H^T = (I-2*u*u^T)*(I-2*u*u^T) = I - 4*u*u^T + 4*u*u^T*u*u^T = I It is called a reflection because H*x is a reflection of x in the plane orthogonal to u (picture). Given x, we can to find u so that H*x has zeros in particular locations, in particular we want u so that H*x = [c,0,0,...,0]^T = c*e1 for some constant c. Since the 2-norm of both sides is ||x||_2, then |c| = ||x||_2. We solve for u as follows: Write H*x =(I-2*u*u^T)*x = x - 2*u*(u^t*x) = c*e1, or u = (x-c*e1)/(2*u^T*x). The denominator is just a constant that we can choose so that ||u||_2 = 1, i.e. y = x +- ||x||_2*e1 and u = y/||y||_2. We write this as u = House(x). We choose the sign of +- to avoid cancellation (avoids some numerical problems) y = [ x(1) + sign(x(1))*||x||_2 , x(2), ..., x(n) ] (picture of two choices). How to do QR decomposition by forming Q as a product of Householder transformations (picture of 5 x 4 example) QR decomposition of m x n matrix A, m >= n for i = 1 to min(m-1,n) ... only need to do last column if m > n u(i) = House(A(i:m,i)) ... compute Householder vector A(i:m,i:n) = (I - 2*u(i)*u(i)^T)*A(i:m,i:n) = A(i:m,i:n) - u(i)*(2*u(i)^T*A(i:m,i:n)) Note: we never actually form each Householder matrix Q(i) = I-2*u(i)*u(i)^T, or multiply them to get Q, we only multiply by them. We only need to store the u(i), which are kept in A in place of the data they are used to zero out (just like the entries of L in Gaussian elimination). (picture) The cost is sum_{i=1 to n} 4*(m-i+1)*(n-i+1) = 2n^2m - (2/3)n^3 = (4/3)n^3 when m=n, twice the cost of GE So when we are done we get Q(n)*Q(n-1)*...*Q(1)*A = R or A = Q(1)^T * ... * Q(n)^T * R = Q(1)*...*Q(n)*R = Q*R and to solve the least squares problem argmin_x ||Ax-b||_2 we do x = R \ ( Q^T * x) or for i=1 to n c = -2*u(i)^T*b(i:m) ... dot product b(i:m) = b(i:m) + c * u(i) ... "saxpy" endfor b = R \ b which costs O(mn), much less than A = QR (again analogous to GE) In Matlab, x = A\b does GEPP if A is square, and solves the least squares problems using Householder QR when A has more rows than columns. Optimizing QR decomposition: We can (and should!) use all the ideas of Chapter 2 to minimize how much data is moved by QR decomposition. We have only described the BLAS2 version (analogous to simplest version of LU). We will not do the details of how to reorganize the algorithm to do lots of matmuls, and reduce the amount of data moved to O(#flops / sqrt(memory_size)) (see Question 3.17), but this can be done. There is one other way to represent Q as a product of simple orthogonal transformations, called Givens rotations. They are useful in other cases than least squares problems (for which we use Householder) so we present them here. A Givens rotation R(theta) = [ cos(theta) -sin(theta) ] [ sin(theta) cos(theta) ] rotates x counterclockwise by theta (picture) More generally, we will take components i and j of a vector and rotate them: R(i,j,theta) = identity except for rows and columns i and j, containing entries of R(theta) (picture) To do QR, we want to create zero entries. So given x(i) and x(j), we want to pick theta so that R(i,j,theta)*x is 0 in entry j: [ cos(theta) -sin(theta) ] * [ x(i) ] = [ sqrt(x(i)^2+x(j)^2) ] [ sin(theta) cos(theta) ] [ x(j) ] = [ 0 ] or cos(theta) = x(i)/sqrt(x(i)^2+x(j)^2), sin(theta) = -x(j)/sqrt(x(i)^2+x(j)^2) We do QR by repeatedly multiplying A by R(i,j,theta) matrices to introduce new zeros, until A becomes upper triangular (picture). Stability of applying Orthogonal matrices By using the basic rule fl(a op b) = (a op b)*(1+delta) , |delta| <= macheps one can show that for either multiplying by one Householder or Givens transformation Q one gets the following, where Q' is the floating point transformation actually represented in the machine, eg Q' = I - 2*u'*u'^T where u' is the computed Householder vector; as can be seen from the above formula, every component u'(i) is nearly equal to the exact value u(i), but roundoff keeps it from being perfect. fl( Q'*A ) = Q'*A + E where norm(E) = O(macheps)*norm(A) This follows from our earlier analysis of dot products, etc. Since Q' is nearly equal to the exact orthogonal Q, so Q' = Q+F with norm(F) = O(macheps), we can also write this as fl( Q'*A ) = (Q+F)*A + E = Q*A + (F*A + E) = Q*A + G where norm(G) <= norm(F)*norm(A) + norm(E) = O(macheps)*norm(A). i.e. you get an exact orthogonal transformation Q*A plus a small error. The same result applies to Givens rotations and block Householder transformations (used to reduce communication). We can also write this as fl(Q'*A) = Q(A + Q^T*G) = Q*(A+F), where norm(F)=norm(G)=O(macheps)*norm(A), i.e. multiplying by Q is backward stable: we get the exact orthogonal transformation of a slightly different matrix A+F. Now multiply by a sequence of orthogonal matrices, as in QR decomposition: fl(Q3'*(Q2'*(Q1'*A))) = fl(Q3'*(Q2'*(Q1*A+E1))) = fl(Q3'*(Q2*(Q1*A+E1)+E2)) = Q3*(Q2*(Q1*A+E1)+E2)+E3 = Q3*Q2*Q1*A + Q3*Q2*E1 + Q3*E2 + E3 = Q3*Q2*Q1*A + E = Q*A + E = Q*(A + Q^T*E) = Q*(A+F) where ||F|| = ||E|| <= ||E1|| + ||E2|| + ||E3|| = O(macheps) so multiplying by many orthogonal matrices is also backwards stable. This analysis explains not just why QR is stable (and so why norm(Q^T*Q-I) = O(macheps), unlike Gram-Schmidt)) but will explain why our eigenvalue algorithms are stable, since they (mostly) just multiply by orthogonal matrices. Solving a Least Squares Problem with A is rank deficient How do we minimize ||A*x-b||_2 when A is rank deficient? Thm: Let A by m x n with m >= n, but rank r < n. Let A = U*Sigma*V^T = [U1,U2,U3]*[Sigma1 0 ] * [V1,V2]^T [ 0 Sigma 2 ] [ 0 0 ] be the SVD of A where U is m x m, U1 is m x r, U2 is m x n-r, U3 is m x m-n Sigma is m x n, Sigma1 is r x r, Sigma2 is n-r x n-r and exactly 0 V is n x n, V1 is n x r, V2 is n x n-r Then the set of vectors x minimizing ||A*x-b||_2 can be written as {x = V1 * Sigma1^(-1) * U1^T * b + V2 * y, y any vector of dimension n-r} The unique vector minimizing ||A*x-b||_2 and ||x||_2 is x = V1 * Sigma1^(-1) * U1^T * b , i.e. the choice y = 0 Def: A^+ = V1 * Sigma1^(-1) * U1^T is the Moore-Penrose pseudo-inverse of the rank-r matrix A (includes the full rank case n=r). So for square or not, full rank or not, the "best" solution to minimize ||A*x-b||_2 x is x = A^+*b Proof: || A*x-b ||_2^2 = || U*Sigma*V^T*x - b||_2^2 = || Sigma*V^T*x - U^T*b ||_2^2 = || [ Sigma1 * V1^T * x - U1^T * b ] ||_2^2 || [ - U2^T * b ] || || [ - U3^T * b ] || = || Sigma1 * V1^T * x - U1^T * b ||_2^2 + || U2^T * b ||_2^2 + || U3^T * b ||_2^2 Now write x = V*V^T*x = [V1,V2]*[V1^T;V2^T]*x = [V1,V2]*[V1^T*x;V2^T*x] = [V1,V2]*[y1;y2] = V1*y1 + V2*y2 Note that V2*y2 is in the null space of A. Substituting for x, we get || A*x-b ||_2^2 = || Sigma1 * V1^T * (V1*y1 + V2*y2) - U1^T * b ||_2^2 + || U2^T * b ||_2^2 + || U3^T * b ||_2^2 = || Sigma1 * (y1) - U1^T * b ||_2^2 + || U2^T * b ||_2^2 + || U3^T * b ||_2^2 which is clearly minimized by y1 = Sigma1^(-1) * U1^T * b and any y2, i.e. x = V1*y1 + V2*y2 = V1 * Sigma1^(-1) * U1^T * b + V2*y2 for any y2. Since ||x||_2^2 = || [y1;y2] ||_2^2 = || y1 ||_2^2 + || y2 ||_2^2, ||x||_2^2 is minimized by the choice y2=0, as claimed. QED Solving a Least Squares Problem when A is (nearly) rank deficient, with the SVD We have defined the the condition number of a matrix as sigma_max / sigma_min, where sigma_min = 0 for a rank deficient matrix, so the condition number is formally infinite. To illustrate this, compare the (minimum norm) least square solutions of argmin || [ 1 0 ] * [ x1 ] - [ 1 ] ||_2 = [ 1 ] || [ 0 0 ] [ x2 ] [ 1 ] || [ 0 ] || [ 0 0 ] [ 1 ] || and argmin || [ 1 0 ] * [ x1 ] - [ 1 ] ||_2 = [ 1 ] || [ 0 e ] [ x2 ] [ 1 ] || [ 1/e ] || [ 0 0 ] [ 1 ] || which is arbitrarily different as e approaches 0. Given this sensitivity, it is natural to ask what sense it can make to solve a rank deficient least squares problem, when the solution can change discontinuously? In particular, we usually only know A to within some tolerance tol, i.e. the true A' could satisfy || A - A' || <= tol. When A is nearly singular, what should we do? Def: Given A and some tolerance tol bounding the uncertainty in A, the "truncated SVD" of A is defined as A = U * Sigma(tol) * V^T, where A = U*Sigma*V^T is the usual SVD, and Sigma(tol) = diag(sigma_i(tol)), where sigma_i(tol) = { sigma_i if sigma_i >= tol { 0 if sigma_i < tol In other words, we "truncate" all the singular values smaller than tol to zero. Or, we replace A by the lowest rank matrix within a distance tol (measured by the 2-norm) of A. Using the truncated SVD effectively replaces the usual condition number sigma_max(A)/sigma_min(A) by sigma_max(A)/tol, which can be much smaller, depending on the choice of tol. In other words, tol is a "knob" one can turn to trade off conditioning with how closely A*x can approximate b (raising tol, and so lowering the rank of A, decreases the dimension of the space can can be approximated by A*x). Replacing A by an "easier" matrix is also called "regularization" (using the truncated SVD is just one of several mechanisms). Lemma: The difference between x1 = argmin_x ||A'*x-b1||_2 and x2 = argmin_x ||A'*x-b2||_2 (where we choose the solution of smallest norm in both cases) is bounded by ||b1-b2||_2 / tol. So choosing a larger tol limits the sensitivity of the solution. Proof: Let A = U*Sigma*V^T and A' = U*Sigma'*V^T as above. Then || x1-x2 ||_2 = || (A')^+(b1-b2) ||_2 = || V*(Sigma')^+*U^T*(b1-b2) ||_2 = || diag(1/sigma(1) , 1/sigma(2) , ... , 1/sigma(k), 0 ... 0 ) *U^T*(b1-b2) ||_2 where sigma(k) > tol <= (1/sigma(k)) * || U^T*(b1-b2) ||_2 <= (1/tol) * || b1-b2 ||_2 Solving a Least Squares Problem when A is (nearly) rank deficient, with QR As before, the SVD makes everything easy, but it is expensive, several times as much as QR (depends on your computer), so a cheaper alternative to the truncated SVD is the QR decomposition with column pivoting: Suppose we did A = QR exactly, with A of rank r < n; what would R look like? If the leading r columns of A were full rank (true for "most" such A), then R = [ R11 R12 ] with R11 r x r and full rank, so R22 = 0. [ 0 R22 ] If A nearly low-rank, we can hope for || R22 || < tol, and set it to zero. Assuming that this works for a moment, write A = QR = [Q,Q']*[ R ] with [Q,Q'] = [Q1,Q2,Q'] square and orthogonal [ 0 ] as before, with Q1 m x r, Q2 m x (n-r) and Q' m x (m-n) Thus argmin_x || Ax-b ||_2 = argmin_x || [Q1,Q2,Q']*[ R ] * x - b ||_2 [ 0 ] = argmin_x || [ R11 R12 ] * [ x1 ] - [ Q1^T*b ] ||_2 [ 0 0 ] [ x2 ] [ Q2^T*b ] [ 0 0 ] [ Q'^T*b ] = argmin_x || [ R11*x1 + R12*x2 - Q1^T*b ] ||_2 [ - Q2^T*b ] [ - Q'^T*b ] with solution x1 = inv(R11)*(Q1^T_b - R12*x2) for any x2 But how do we pick x2 to minimize || x ||_2? and is the solution as smooth as desired, i.e. could cond(R11) be >> 1/tol? Ex: A = .5*eye(n) - diag(ones(n-1,1),1) is already upper triangular, so A = QR = R. But cond(A) ~ 2^n, (inv(A))(i,j) = 2^(j-i+1) above diagonal) so at least one singular value is ~ 2^(-n), and should be set to zero when n is large. But if we order the columns of A to be Ap = A(:,[2:n,1]) and then do QR, we get Ap = QR with R22 = R(n,n) ~ 2(-n) the other R(i,i) = O(1), and R11 = R(1:n-1,1:n-1) with cond(R11) = O(1), R12 = R(1:n-1,n) with norm(R12) = O(1) So choosing x2 = 0 and x1 = inv(R11)*Q1^T*b is a solution with small norm, as desired. (matlab demo) How to choose permutation in general? Algorithm called column pivoting: at each step, look at the remaining columns not yet reduced to upper triangular form, and pick the one with the biggest norm Available in Matlab as [Q,R,P]=qr(A) (permutation returned in A), which uses LAPACK routine (dgeqrp.f)